Integrating Data Engineering and Generative AI: A World of Digital Intersection

Em Blog Data Engineer Gen Ai Main Image

Data engineering, the practical application of data collection and analysis, involves gathering, storing, processing, and transforming data into a format that humans (analysts) and machines (algorithms) can use to make decisions and drive action.

Examples of data engineering applications include e-commerce product recommendations based on user activity, health insights from wearable devices, and smart city traffic management using real-time sensor data. Data engineers are typically responsible for the entire data management lifecycle, including discovering the data, choosing the systems, implementing these systems, and creating the data pipelines.

Generative AI, on the other hand, is a content generation model based on machine learning algorithms. It involves training AI models to generate new data or content on behalf of humans. This new information mimics or resembles the original data. Examples include statistical data forecasts, customer service chatbot responses, summarized business presentations, and graphical designs.

Data engineering continues to be a typical function found in many organizations.1 Experts at McKinsey also recently found that one-third of all surveyed organizations already regularly use generative AI in at least one function, and 40% of those reporting AI adoption expect to invest more in AI overall.2

Mutual dependence on data

Data engineering and generative AI are both heavily reliant on data. The success of both disciplines relies heavily on the data’s quality and accessibility. Typically, the data engineer ensures the quality of the data being used throughout the organization by other humans, systems, and algorithms.

Data engineers prepare and organize data and feed it to AI models. Generative AI then analyzes and presents the data in a way people can understand. Both functions create a bridge between systems and humans.

To illustrate this intersection, consider a customer service scenario where a data engineer prepares and maintains the data from customer interaction information and feeds them into a generative AI model. The model, which was likely trained on similar data captured previously, generates human-like responses to customer inquiries—the goal is to provide a more engaging, effective, and streamlined customer service experience.

Another example can be found in financial services use cases that prompt a custom generative AI interface that collects data like internal analyst reports, annual shareholder reports, earnings call transcripts, and internal investment notices. This information is used to answer questions regarding future investment opportunities and risk analysis. Generative AI can also help reduce the workload of repetitive tasks, allowing workers to focus on more strategic areas.3

McKinsey researchers estimate that generative AI could add the equivalent of $2.6 trillion to $4.4 trillion annually across 63 use cases they analyzed. The global economic impact on productivity could double in value when factoring in generative AI embedded into software that is currently used for other tasks beyond those use cases.4

Broad categories of value capture from generative AI include:5

  • Cost reduction: Lower costs, primarily through automation and job substitutions
  • Process efficiency: Automate and reduce manual tasks
  • Growth: Increase revenue through hyper-personalized marketing for target customers
  • Accelerating innovation: Deliver new products or services faster with speedier go-to-market
  • Discovery and insights: Uncover new ideas, insights, and questions to unleash creativity

Challenges in integrating data engineering and generative AI

As with any advanced technology that pushes the limits of what’s possible, there are potential extreme outcomes and inherent risks. The intersection of data engineering and generative AI is no different.

For those following the trajectory of generative AI, limitations and risks include hallucination, bias, lack of human reasoning, and limited context window.6

However, several additional challenges may arise as data engineering and generative AI integrate into formal business processes. According to a recent study, generative AI presents numerous ethical issues, including manipulation and the ability to deceive users, copyright abuses, lack of accountability, and ethical risks such as safety, robustness, fairness, transparency, and environmental impact. Security, privacy, and related regulatory requirements also add complexity to the mix.7

Here’s a summary of the main challenges:

  • Data privacy: In many cases, the data used for generative AI training and operations includes sensitive information, which raises privacy concerns regarding regulation and ethics.
  • Regulation compliance: With regulations like GDPR, CCPA, and the California Delete Act, organizations must be purposeful in how they collect, store, and use data with their generative AI models.
  • Model transparency: Not being able to explain “black box” generative AI models can be problematic because it’s crucial to understand why models generated an output or made a specific prediction.
  • Ethical use: Generating realistic content in images, voice, or text that makes it difficult to distinguish it from authentic human-created content can have profound ethical and legal implications for organizations.
  • Data access: Restrictions or policy limitations on using specific datasets could hamper generative AI models’ development, training, and effectiveness.
  • Data quality: Generative AI success directly correlates to the quality of the training data it uses—biased, incomplete, outdated, or otherwise flawed data can produce inaccurate or misleading outputs.

Addressing each of these challenges requires careful planning. Robust data governance and an ethical framework for the use of AI is also paramount. Organizations must work closely with legal, ethics, and data management experts when integrating data engineering and generative AI.

In the next post, we will look at how the role of platform engineering can help accelerate the time to value for AI-driven transformations while alleviating some of these issues.

  1. McKinsey, The State of Organizations 2023, May 2023
  2. McKinsey, The state of AI in 2023: Generative AI’s Breakout Year, July 2023
  3. Google Cloud, The Prompt: Choosing Generative AI Use Cases, April 2023
  4. McKinsey, The Economic Potential of Generative AI, June 2023
  5. Deloitte, The Generative AI Dossier, September 2023
  6. Deloitte, Generative AI is All the Rage, April 2023
  7. NTT Data, Ethical Considerations of Generative AI, June 2023