With every technology company shouting “Big Data”, we are led to think analytics challenges can be solved simply by storing a whole mess of data. With current technology, storing large volumes of data is easy. It also provides absolutely no value. Value only comes from data when it is examined, manipulated, learned from, and acted upon. To extract business insights you have to move data, exercise data (or is it exorcise) and that isn’t easy at all.
With the current trend toward IT becoming more of a technology provider than solution provider, it is becoming the responsibility of the analyst to choose from technologies, techniques, and engines. Gaining value from ever increasing volumes, varieties, and velocities of data is shining a spotlight on weaknesses technologies have always had; performance at scale and the ability to support concurrent workloads.
With the increasing need to consider entire data populations instead of just samples, data volumes are greater through every step of the analytic process. Not all engines and platforms perform equally. If you are using the right tools, even large, complex processes should run in a few minutes. If they aren’t, consider your approach to the solution and the technology you’re relying on.
In the current market, analysts are constantly pelted with buzzwords – AI, machine learning, deep learning, neural networks. Remember it’s all just math. Computers are really good at math, all computers. Some technologies want you to think they are the only choice for certain types of advanced analytics. There is no test or technique that can only be done with a single language or using a single platform. Start where the data lies. Start with a simple approach on your fastest platform.
More data means larger platforms and more sharing of resources. Concurrency has been a vulnerability for many technologies and that trend continues. Fancy platforms aren’t going to hold up through the analytic lifecycle if they can only support a handful of users or concurrent processes. Almost all platforms and engines will hit a concurrency ceiling no matter how much compute power they have. What matters is where that ceiling is. This has been a consistent challenge for open source. Since these projects are often not tested at scale, most open source bugs are concurrency related. Be aware and use appropriate caution.
As numbers people, it is easy for us analysts to get hung up on precision. In a business context, there is a clear preference for false negatives or false positives so you have an intentionally lower precision threshold. Don’t over engineer your solution. Choose the simplest and most performant solution that provides the business value you need. Basic statistics are still the best fit for binary data. Set logic is faster than procedural. Take advantage of parallel systems. You will be a hero as you rapidly show results. You will also produce processes that can actually be operationalized which is a super power in itself.
Most importantly, don’t overlook the value of concluding a path isn’t showing results and it’s time to move on to the next challenge.