Table of contents

Just for you

3 best practices for using retrieval-augmented generation (RAG)

5 benefits of retrieval-augmented generation (RAG)

Why normalized data is critical for best-in-class retrieval-augmented generation (RAG)

Jon Gitlin

Senior Content Marketing Manager

@Merge

David Dalmaso

Software Engineer

@Merge

Customer data is essential to supporting most retrieval-augmented generation (RAG) use cases.

It allows SaaS products to power enterprise AI search, provide targeted lead recommendations, surface high-fit candidates, and more.

But customer data, in and of itself, isn’t enough to power your product’s RAG use case(s) effectively. This data also needs to be normalized, or standardized and transformed into a consistent format across multiple systems or platforms.

How file creation date fields can be normalized — *Fields related to a file's creation date can be normalized across file storage solutions, or transformed into a common format*

You can read on to learn why normalized data is pivotal for RAG.

Generate more accurate outputs

When customer data is normalized, the data is presented in a consistent format and only includes the necessary information. As a result, the embedding algorithm can easily understand the semantic similarities (and differences) between different types of data.

All of the data that’s semantically similar has embeddings (i.e., vectors) that are close to one another in the vector database, while non-normalized data can lead to embeddings that are more spread out in the database.

For example, say all of your employee objects and fields are normalized, including the “Manager” field. Here’s how this normalized data can support the retrieval portion of your RAG pipeline:

1. A user makes a query in, say, your enterprise AI search, like “Who reports to John?”

2. The embedding algorithm would then transform the query’s plain text into a vector, or a numerical representation of the text’s meaning.

3. The embedded query would be compared to embeddings in the vector space, and all of the embeddings that are close to the embedded query (i.e., all of the employees who report to John) would be fetched.

An examples of how normalized data is close in proximity in a vector database

4. Both the embeddings that are fetched and the embedded query get fed to the LLM you use.

5. The LLM can then generate an accurate response.

https://www.merge.dev/blog/ai-connector?blog-related=image

Prevent sensitive data from being retrieved

The process of normalizing data also gives you the opportunity to remove certain types of data. This gives you more control over the information that’s retrieved and used by the LLM.

As a result, you can avoid scenarios where the LLM would generate outputs with sensitive information on your business, employees, candidates, etc.

For example, say you’re normalizing customers’ accounting data and you don’t want the LLM you use to retrieve personally identifiable information (PII).

You can normalize the accounting data such that all PII fields, like tax numbers associated with companies, are removed before the data gets embedded and added to the vector database.

How normalization can remove sensitive data — *The process of normalizing data can include removing certain fields that are sensitive, like organizations’ tax numbers in your customers’ ERP systems*

https://www.merge.dev/blog/ai-companies-raw-and-normalized-customer-data?blog-related=image

Avoid duplicate data in outputs

Similar to the benefit above, the normalization process allows you to deduplicate data.

So, for example, if there are 5 duplicate employee records in a customer’s HRIS, normalizing them could lead 4 to get removed.

Screenshot of how data normalization can help deduplicate data

‍

This allows the LLM to retrieve just a single copy of the record and, in turn, only generate a single copy in its outputs.

https://www.merge.dev/blog/rag-benefits?blog-related=image

Leverage normalized data across your RAG use cases by using Merge

Merge, the leading unified API solution, lets you add hundreds of integrations through a single build.

Merge also normalizes the data that’s synced from all these integrations according to its Common Models, or predefined data models. This allows you to access normalized data from all of your customers’ ticketing systems, HRISs, ATSs, CRMs, file storage solutions, and accounting systems.

An image that shows how Merge normalizes specific ticketing data — *How Merge normalizes the ticket type object and completed status field across your customers’ ticketing systems*

Merge also has advanced features to give you full control over the customer data you can and can’t sync and normalize, such as Scopes.

Learn more about how Merge can normalize your data and discover how companies use Merge to power their RAG use cases by scheduling a demo with one of our integration experts.

Jon Gitlin

Senior Content Marketing Manager

@Merge

Jon Gitlin is the Managing Editor of Merge's blog. He has several years of experience in the integration and automation space; before Merge, he worked at Workato, an integration platform as a service (iPaaS) solution, where he also managed the company's blog. In his free time he loves to watch soccer matches, go on long runs in parks, and explore local restaurants.

David Dalmaso

Software Engineer

@Merge

3 insider tips for using the Model Context Protocol effectively

We’ve launched Merge MCP to help AI companies leverage our integrations in minutes! Here’s how to use it

Company

MCP vs API: how to understand their relationship

Subscribe to the Merge Blog

Get stories from Merge straight to your inbox

But Merge isn’t just a Unified  API product. Merge is an integration platform to also manage customer integrations. gradient text

Add hundreds of integrations to your product through Merge’s Unified API

Just for you

3 best practices for using retrieval-augmented generation (RAG)

5 benefits of retrieval-augmented generation (RAG)

Why normalized data is critical for best-in-class retrieval-augmented generation (RAG)

Generate more accurate outputs

Prevent sensitive data from being retrieved

Avoid duplicate data in outputs

Leverage normalized data across your RAG use cases by using Merge

Read more

3 insider tips for using the Model Context Protocol effectively

We’ve launched Merge MCP to help AI companies leverage our integrations in minutes! Here’s how to use it

MCP vs API: how to understand their relationship

Subscribe to the Merge Blog

Add hundreds of integrations to your product through Merge’s Unified API

Just for you

3 best practices for using retrieval-augmented generation (RAG)

5 benefits of retrieval-augmented generation (RAG)

Why normalized data is critical for best-in-class retrieval-augmented generation (RAG)

Generate more accurate outputs

Prevent sensitive data from being retrieved

Avoid duplicate data in outputs

Leverage normalized data across your RAG use cases by using Merge

Read more

3 insider tips for using the Model Context Protocol effectively

We’ve launched Merge MCP to help AI companies leverage our integrations in minutes! Here’s how to use it

MCP vs API: how to understand their relationship

Subscribe to the Merge Blog

3 ways to drive business results with your new Merge integrations

3 ways to drive business results with your new Merge integrations

Get our best content every week