March 28, 2023 By Hebert W. Pereyra
Jonathan Sloan
3 min read

The real-world challenges organizations are facing with big data today are multi-faceted. Every IT organization knows there is more raw data than ever before; data sources are in multiple locations, in varying forms and of questionable data quality. To add complexity, the business users of, and use cases for, data have become even more varied. The new data being used for decision support and business intelligence might also be used for developing machine learning models. In addition, semi-structured data, such as JSON and IoT log files, might need to be mixed with transactional data to get a complete picture of a customer buying experience, while emails and social media content might need to be interpreted to understand customer sentiment (i.e., emotion) to enrich operational decisions, machine learning (ML) models or decision-support applications. Choosing the right technology for your organization can help solve many of these challenges. 

Addressing these data integration issues might appear to be relatively easy: land all enterprise data in a centralized data lake and process it from start to finish. But that is a bit too simplistic because simultaneously, real-time data needs to be processed for decision-support access and often the curated data inputs reside in a data warehouse repository. And keeping data copies synchronized between the physical platforms supporting Hadoop based data lakes and data warehouses and data marts can be challenging.

Warehouses are known for the high-performance processing of terabytes of structured data for business intelligence but can quickly become expensive for new, evolving workloads. And when it comes to price-performance, the reality is that organizations are running data engineering data pipelines and data science machine learning model building workflows in data warehouses that are not necessarily optimized for scalability or to run these challenging workloads – impacting pipeline performance and driving up costs. It is this complex set of different data relationship dependencies requiring continuous data movement of interdependent data sets across platforms that makes these data challenges so complex to solve.

Read the Forrester Wave ‘Data Management for Analytics’ report

Rethinking data analytics architecture

Software architects at vendors understand these challenges, and several companies have tried to address the challenges in their own way. New workload requirements led to new functionality in software platforms that were not specifically optimized for these workloads, reducing their efficiency and worsening data silos within many organizations. Additionally, each platform must have overlapping copies of data, implying issues with data management (data governance, privacy, and security) and higher costs for data storage.

For these reasons, the challenges of the traditional data warehouse and data lake architecture have led businesses to operate complex architectures, with data siloed and copied across data warehouses, data marts, data lakes, and other relational databases throughout the organization. Given the prohibitive costs of high-performance on-premises and cloud data warehouses, and performance challenges within legacy data lakes, neither of these repositories satisfy the need for analytical flexibility and price-performance.

Instead of having each new technology solve the same problem, what is needed is a fresh, new architectural style.

Fortunately, the IT landscape is changing due to a mix of cloud computing platforms, open source, and traditional software vendors. Cloud vendors, leading with object storage, have helped to drive down the cost of disk storage. But data stored in object storage cannot readily be updated and object storage does not offer the type of query performance to which business users have become accustomed. Open-source technology such as Apache Iceberg combined with open-source engines such as Presto and Apache Spark are providing the advantage of object storage along with the business capabilities of better SQL performance and the ability to update large structured and semi-structured data in place. But there is still a gap to be filled that allows all these technologies to work together as a coordinated, integrated platform.

To truly solve these challenges, query, and reporting, provided by engines such as Presto, needs to work along with the Spark infrastructure framework to support advanced analytics and complex data transformations. And Presto and Spark need to readily work with existing and modern data warehouse infrastructures.

The industry is waiting for a breakthrough approach that allows organizations to optimize their analytics ecosystem by selecting the right engine for the right workload at the right cost — without having to copy data to multiple platforms and while taking advantage of integrated metadata. Whichever vendor gets there first will allow organizations to reduce cost and complexity and drive the greatest return on investment from their analytics workloads while also helping to deliver better governance and data security.

Learn more about data management solutions
Was this article helpful?
YesNo

More from Analytics

How the Recording Academy uses IBM watsonx to enhance the fan experience at the GRAMMYs®

3 min read - Through the GRAMMYs®, the Recording Academy® seeks to recognize excellence in the recording arts and sciences and ensure that music remains an indelible part of our culture. When the world’s top recording stars cross the red carpet at the 66th Annual GRAMMY Awards, IBM will be there once again. This year, the business challenge facing the GRAMMYs paralleled those of other iconic cultural sports and entertainment events: in today’s highly fragmented media landscape, creating cultural impact means driving captivating content…

How data stores and governance impact your AI initiatives

6 min read - Organizations with a firm grasp on how, where, and when to use artificial intelligence (AI) can take advantage of any number of AI-based capabilities such as: Content generation Task automation Code creation Large-scale classification Summarization of dense and/or complex documents Information extraction IT security optimization Be it healthcare, hospitality, finance, or manufacturing, the beneficial use cases of AI are virtually limitless in every industry. But the implementation of AI is only one piece of the puzzle. The tasks behind efficient,…

IBM and ESPN use AI models built with watsonx to transform fantasy football data into insight

4 min read - If you play fantasy football, you are no stranger to data-driven decision-making. Every week during football season, an estimated 60 million Americans pore over player statistics, point projections and trade proposals, looking for those elusive insights to guide their roster decisions and lead them to victory. But numbers only tell half the story. For the past seven years, ESPN has worked closely with IBM to help tell the whole tale. And this year, ESPN Fantasy Football is using AI models…

IBM Newsletters

Get our newsletters and topic updates that deliver the latest thought leadership and insights on emerging trends.
Subscribe now More newsletters