Westworld’s Big Data Lesson: to Move Forward, We Must Be Able to Look into the Past
In this special guest feature, Pedro Castillo, CEO and Founder of Logtrust, discusses how HBO’s popular Westworld is introducing Big Data and AI concepts to a whole new audience with each episode, and people are noticing. While technologists in Silicon Valley may understand why the Bernard character was able to check legacy data against known (or current) data to find anomalies, the average viewer may not. Pedro is a veteran of the security industry with over 15 years of experience in the IT and Financial Services sectors. His expertise in Big Data and Cloud Security are rooted in Spain’s stringent data privacy and protection requirements, where he successfully deployed real-time big data-in-motion-analytics in three major European industries (financial, telecommunications, cyber security). Prior to founding Logtrust in 2011, Pedro spent 11 years with Spanish bank Bankinter, first serving as the Technical Security Director and later as the New Technologies Director.
The popular HBO series Westworld introduced Big Data and AI concepts to a whole new audience, and people are noticing. It’s also teaching us important lessons that need to be taken into account by companies that are building their data analytics strategies.
Episode 6 in particular piqued viewers’ interests when a legacy data access issue marked an important development in the storyline. While technologists may understand why Bernard needed to check legacy data against current data to find anomalies, most also know (and may have experienced first-hand) why this is still a difficult exercise – one that especially cannot be completed in the short time frame depicted by the show.
Vast networks of connected robots–what could possibly go wrong?
Companies, just like in Westworld, are tasked with managing a vast interconnected network of devices. We may not think of them as robots, but that’s essentially what they are. Consequently, organizations face a growing need to collect, visualize and analyze massive amounts of data at high speed. But, just as importantly, they must be able to contextualize it.
Westworld may be considered a next gen example of the need for immediate correlation of streaming data with legacy data–one that’s really not that far off from the current reality. In order to detect an anomaly in any system, you have to be able to draw a solid baseline of the norm, which in some cases may require comparing it with data that may go back decades. Going a step further, fixing it requires the ability to project the anomaly against a backdrop of a host of legacy data sets that describe both past behavior, and original design.
To be sure, refrigerators and toasters may not kill or enslave anyone a la Westworld, but let’s imagine that something went wrong in the algorithms that allow a fleet of driverless cars to navigate around traffic. You can only imagine the chaos that could ensue, and you can envision teams of developers and engineers scrambling to figure out what went wrong by comparing current erratic behavior with past normal behavior. The speed at which they’re able to correlate data sets could determine whether lives are saved or lost.
Something like this could happen accidentally, as the result of faulty programming, or intentionally.
Consider the recent DDoS attack on Dyn that interrupted service for some of the world’s major online service providers and has left security experts scrambling to determine how to secure the Internet of Things. Right now there’s a lot of emphasis being placed on securing these devices. However, as external devices operating at the edge are always going to be tough to secure, we need to pursue with equal fervor the ability to rapidly analyze and stem suspicious behavior occurring in these networks.
Analyzing legacy and real-time data side-by-side is tougher than it sounds
In order to perform real-time analytics on streaming data, but also compare and correlate current data with historical (batch), and complex (multi-source) data sets to deliver insights like what Bernard was searching for in Episode 6 often requires a combination of technologies. The data analytics structures for historical (batch) data fundamentally differ from those of streaming (real-time) data, and many companies are still trying to solve these problems.
Batch data processing involves high volume transaction data collected over a period of time. In contrast, real time data processing involves continual input, processing and output of data, where data is processed in near real time. Batch is the typical domain of Hadoop, or a data warehouse, and when the goal is to reconcile real time and batch processing when dealing with large data sets in Hadoop, you need a processing layer apart from MapReduce. Storm, Spark and SQLstream are often employed on top of Hadoop or data warehouses to help with real time analytics.
Such attempts to take advantage of both batch- and stream-processing methods while balancing latency, throughput, and fault-tolerance are often referred to as the Lambda architecture. It may all sound good, but as batch and streaming analytics require different tool-sets with different code sets, and have development objectives that are somewhat at odds with each other (speed vs. completeness), they require a lot of developer expertise and time to put together. Assembling these in order to build a system may take months, or even years.
Furthermore, streaming analytics in and of itself is challenging. According to a new survey from 451 Research, 53% said their technology wasn’t even capable of human real-time analytics, meaning that data isn’t going to be available for analysis for at least five minutes after a specific event occurs. While five minutes may not sound like a long time, in many instances it could be enough time for chaos to ensue (and keep in mind that ‘five minutes’ is a best-case scenario for most companies–in the vast majority of cases, it takes much longer).
What it’ll take
The technical requirements of a system that allows you to analyze streaming data against legacy data aren’t easy to achieve, but there are a few basic things that organizations need. First, it requires the ability to maintain ultra-low response time, regardless of volume and retention time; you have to be able to recall data that was created a decade ago as fast as data that was created a split second ago. This means that, for the sake of these applications, we have to ditch the “hot/cold” storage paradigm. All the data that you might possibly need to analyze in a situation like this has to be hot all of the time.
Second, it requires the ability to massively scale, which means that for most companies these capabilities need to be built in the cloud in order to take advantage of the ability to “scale out” by adding server nodes to a cluster. Analytics applications will also need to run across clouds and on-premise platforms, as it’s unlikely that historical data will be, for the foreseeable future, collected in one place.
It also requires the ability to ingest, query, process, and analyze millions of events per day from sources that might range from network packets, IoT sensors, I/Os, memory, and VMs to AWS/S3, Azure, transactions, websites, social media platforms and more. Consequently, this will also involve unbounded data volumes with high variability data flows that arrive continuously and unpredictably.
It has to make sense to machines, because some of the challenges that companies face will actually need to be addressed faster than even Bernard can act on them. This means that companies need to build a system that triggers automatic alerts that notify very specific parties at the moment deviations occur, and triggers appropriate reactions from a library of responses.
We won’t be able to automate every response though, and the most serious problems will most likely mandate that such a system also provide the ability to make the data make sense to human beings. This means that you’ve got to have the ability to quickly generate visual analytics on disperse data sets, and then have the ability to bring dashboards together in order to correlate them to detect deviations and spot hidden data relationships on high volume, high velocity data streams. And, as dealings with these kinds of problems will likely require expertise from multiple departments (engineers and IT are going to have to collaborate), you’re going to need the ability to quickly share and collaborate on this information.
‘Westworld’ is playing out all around us
Sentient robots may not be on the brink of revolution–for now–but we’re seeing scenarios that are analogous to this playing out in numerous areas. Consider how operational analytics is applied to an area like prognostics, where monitoring vital health of components in real-time for safety and reliability, means combining on-board system telemetry and a holistic view of all systems and their history to analyze aging and probability of failure. Or, think about security analytics, where companies must analyze historical data alongside recent data in real-time, performing forensic operations that detect, hunt and counter hackers in their cyber kill chain progression.
Likewise, marketing departments increasingly hope to apply analytics to understanding how people are talking about a brand in real-time, and comparing with historical trends to create messages that resonate in the moment and over time. And of course there’s the much talked about example of IoT analytics, where comparison with telemetry and historical data may enable companies to optimize vehicle fleets, track product on shelves to optimize delivery and supply chain, and react to changes in consumer demand as they occur. In many ways, Westworld is already upon us!