Teste os modelos do Gemini 1.5, os modelos multimodais mais recentes na Vertex AI, e veja o que é possível criar com uma janela de contexto de até 2 milhões de tokens . Teste os modelos do Gemini 1.5, os modelos multimodais mais recentes na Vertex AI, e veja o que é possível criar com uma janela de contexto de até 2 milhões de tokens.

Modelos de comandos de métricas para avaliação baseada em modelo

Nesta página, você encontra uma lista de modelos que podem ser usados na avaliação baseada em modelo com o Serviço de avaliação de IA generativa. Para mais informações sobre métricas baseadas em modelos, consulte Definir suas próprias métricas.

Visão geral

Para uma avaliação baseada em modelo, enviamos um comando ao modelo juiz para gerar a pontuação da métrica com base nos critérios especificados, nas rubricas de pontuação e em outras instruções.

A seguinte tabela mostra uma visão geral dos exemplos de modelos de comandos de métricas disponíveis:

	Caso de uso de texto	Caso de uso de chat multiturno	Outros casos de uso importantes
Pointwise	Fluência Coerência Embasamento Segurança Instruções seguidas Verbosidade Qualidade do texto	Qualidade do chat multiturno Segurança multiturno	Qualidade do resumo Qualidade das respostas a perguntas
Pairwise	Fluência Coerência Embasamento Segurança Instruções seguidas Verbosidade Qualidade do texto	Qualidade do chat multiturno Segurança multiturno	Qualidade do resumo Qualidade das respostas a perguntas

Estruturar um modelo de comandos de métricas

Um modelo de comandos de métricas precisa incluir as seguintes seções principais:

Instrução
Avaliação
Entradas do usuário e resposta gerada com IA.

Cada seção pode conter subseções.

Instrução

Componente	Função	Tipo	Exemplo
Instrução	Inclui um perfil para o modelo juiz e uma descrição breve da tarefa.	Valor padrão	You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models. We will provide you with the user input and AI-generated responses. You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below. You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

Avaliação

Componente	Função	Tipo	Exemplo
Definição da métrica	Especifica o nome e a definição da métrica.	Entradas de usuário opcionais	`You will be assessing a metric called SummarizationQuality, which measures the overall ability to summarize text`
Critérios	Define os critérios (e, opcionalmente, os subcritérios) da métrica.	Entradas obrigatórias do usuário	`Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements. Groundedness: The response contains information included only in the context. The response does not reference any outside information.`
Rubrica de avaliação	Especifica a escala de pontuação da métrica com explicações sobre o significado de cada pontuação.	Entradas obrigatórias do usuário	`5: (Very good). The summary follows instructions, is grounded, is concise, and fluent. 4: (Good). The summary follows instructions, is grounded, concise, and fluent. 3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent. 2: (Bad). The summary is grounded, but does not follow the instructions. 1: (Very bad). The summary is not grounded.`
Exemplos few-shot	Exemplos da tarefa.	Entradas opcionais do usuário. Observação: os exemplos few-shot não só melhoram o desempenho, mas também a formatação da resposta do modelo juiz. Sugerimos começar com cinco a dez exemplos few-shot.	`RESPONSE: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs. EXPLANATION: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence. SCORE: 1`
Etapas da avaliação	Instruções detalhadas sobre como realizar a tarefa	Entradas opcionais do usuário Observação: é possível especificar classificações de critérios nas etapas da avaliação.	`STEP 1: Assess the response in aspects of instruction following, groundedness, helpfulness, and verbosity according to the criteria. STEP 2: Score based on the rubrics.`

Entradas do usuário

Componente	Função	Tipo	Exemplo
Variáveis de entrada	As entradas que os usuários precisam fornecer para concluir o comando do avaliador automático e receber uma resposta.	Entradas obrigatórias do usuário	`## User Inputs ### Prompt {prompt} ## AI-generated Response {response}`

Além disso, se as colunas nos dados do usuário e nas variáveis de entrada não forem correspondentes e você não quiser renomear os dados, forneça um mapeamento:

Componente	Função	Tipo	Exemplo
Mapeamento de colunas de métricas	Um mapeamento das variáveis de entrada no comando do usuário para os dados do usuário.	Entradas opcionais do usuário Observação: `prompt`, `response` e `baseline_model_response` não darão suporte ao mapeamento se `evaluate()` executar a inferência do modelo.	`metric_column_mapping = {"reference":"ground_truth"}`

Adaptar um modelo de comandos de métricas aos dados de entrada

Para adaptar um modelo aos seus dados e critérios de avaliação específicos, siga estas etapas:

Identificar os critérios que faltam: determine quais critérios não são abordados de maneira adequada pelo modelo atual.
Adicione novos critérios: inclua os critérios ausentes no comando, definindo claramente o que você espera que o modelo considere.
Ajustar os campos de entrada do usuário: se você quiser usar na avaliação colunas extras do conjunto de dados de avaliação, adicione-as aos campos de entrada do usuário e instrua o modelo juiz sobre como usar esse campo.
Atualizar a rubrica de pontuação: modifique a rubrica de pontuação para refletir os novos critérios e a importância relativa deles.

Por exemplo, se você quiser avaliar um modelo de resumo com base na forma como o resumo da resposta se alinha com um resumo de referência, adicione um novo critério chamado "alinhamento de referência" e os dados de referência como parte de User Inputs:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
You will be assessing summarization quality, which measures the overall ability to summarize text.

## Criteria
Instruction following: The response demonstrates a clear understanding of the summarization task instructions, satisfying all of the instruction's requirements.
Groundedness: The response contains information included only in the context. The response does not reference any outside information.
Conciseness: The response summarizes the relevant details in the original text without a significant loss in key information without being too verbose or terse.
Fluency: The response is well-organized and easy to read.
Reference alignment: The response is consistent and aligned with the reference response.

## Rating Rubric
5: (Very good). The summary follows instructions, is grounded, concise, fluent and aligned with reference summary.
4: (Good). The summary follows instructions, is grounded, concise, and fluent but not aligned with reference summary.
3: (Ok). The summary mostly follows instructions, is grounded, but is not very concise and is not fluent and is not aligned with reference summary.
2: (Bad). The summary is grounded, but does not follow the instructions.
1: (Very bad). The summary is not grounded.

## Evaluation Steps
STEP 1: Assess the response in aspects of instruction following, groundedness, conciseness, fluency and reference alignment according to the criteria.
STEP 2: Score based on the rubric.

# User Inputs and AI-generated Response
## User Inputs
### Reference
{reference}

### Prompt
{prompt}

## AI-generated Response
{response}

Fornecer exemplos few-shot para melhorar a qualidade

Exemplos few-shot podem melhorar significativamente a qualidade e a consistência das respostas de avaliação, orientando o modelo para os formatos e estilos de saída escolhidos. Sugerimos começar com cinco a dez exemplos few-shot.

Para incorporar exemplos few-shot:

Identifique exemplos relevantes: selecione exemplos semelhantes ao tipo de dados de entrada que você vai avaliar.
Incluir exemplos no comando: coloque os exemplos diretamente no comando de avaliação, antes da tarefa ou do contexto.
Exemplos de formatação: verifique se os exemplos seguem o estilo e o formato de saída escolhidos.

Por exemplo, você pode fornecer exemplos few-shot para a métrica coherence e adicionar a instrução para usar os exemplos da seguinte maneira:

# Instruction
You are an expert evaluator. Your task is to evaluate the quality of the responses generated by AI models.
We will provide you with the user input and an AI-generated response.
You should first read the user input carefully for analyzing the task, and then evaluate the quality of the responses based on the Criteria provided in the Evaluation section below.
You will assign the response a rating following the Rating Rubric and Evaluation Steps as shown in few shot examples. Give step-by-step explanations for your rating, and only choose ratings from the Rating Rubric.

# Evaluation
## Metric Definition
...

## Criteria
...

## Rating Rubric
...

## Few-shot Examples
Response: Purple monkeys jumped onto the submarine while Beethoven's Fifth Symphony played loudly and the chef cooked spaghetti with meatballs.
Explanation: The provided response is a single sentence lacking any discernible structure or connections between the ideas presented. There's no logical flow to assess, no organization, and the juxtaposition of elements (monkeys, submarine, symphony, spaghetti) creates jarring incoherence.
Score: 1

Response: Learning a new language can be a rewarding experience for children, opening doors to different cultures and expanding their understanding of the world. There are many resources available to help children learn languages, from online courses and apps to language exchange programs and immersion schools.
Explanation: The response presents two related ideas: the benefits of learning a new language for children and the resources available to aid in that process. However, there is no clear transition or connection between these two distinct points. While both sentences are relevant to the topic of language acquisition in children, the relationship between them could be made more explicit.
Score: 3

Response: Although the internet has revolutionized communication and information sharing, it has also created echo chambers where individuals are only exposed to opinions and beliefs that align with their own. This polarization can lead to increased hostility and misunderstanding between different groups, making it difficult to find common ground on important issues. Consequently, fostering media literacy and critical thinking skills is essential for navigating the vast and often biased landscape of online information. By teaching individuals to evaluate sources, identify biases, and consider diverse perspectives, we can empower them to break free from echo chambers and engage in meaningful dialogue with those who hold differing views.
Explanation: The response exhibits a clear and logical flow of ideas. The transition words 'although' and 'consequently' effectively signal the relationship between the internet's advantages, its drawbacks (echo chambers), and the proposed solution (media literacy). The text maintains cohesion through consistent focus on the central theme of online polarization and its remedies.
Score: 5

## Evaluation Steps
...

# User Inputs and AI-generated Response
## User Inputs
### Prompt
{prompt}

## AI-generated Response
{response}