An Implementation of Semi-Automated Generated Extracted Prompts by Diana9513 · Pull Request #271 · apache/hugegraph-ai

Diana9513 · 2025-06-10T12:11:23Z

I. Introduction
The current implementation is based on the Deepseek R1 32B model for inference.

A test set has been constructed, covering the following categories: [Traffic Accident Investigation Report, Short Story 1, Short Story 2, Legal Provision 1, Legal Provision 2, Chat Log 1, Industry Standard 1, Process Explanation 1].

II. Current Approaches to Complete the Task
(1) Corresponding to File No. 1
(In the absence of an existing template set) The model is guided through prompts to engage in multi-step reasoning, summarizing the reasoning results to form a prompt (this can be achieved without a chain-of-thought model, as it is not the model's chain-of-thought process).

The prompt includes an entity set, a relationship set, and specifies the generation direction.
This process also extracts the user's #graph-construction instruction.

The entire generation process calls GPT once, but the multi-step reasoning guidance results in a higher token count and longer generation time.

(2) Corresponding to File No. 2
(With an existing template set) The #graph-construction instruction and template set are encoded using TF-IDF vectors, and the most similar vector is matched via cosine similarity (BERT or GPT is not used here, primarily to reduce one API call). The most relevant task template recalled from the template set is integrated into the approach described in (1), and both are submitted to the model for prompt generation.

III. Evaluation Method
Based on eight different long texts, each with two graph-construction requirements, 16 responses are generated.

In the Excel file:

Original Text: Represents the user's graph-construction text.

Instruction: Represents the user's brief requirement, such as "character relationship graph" (six characters in Chinese).

During inference, only the "original text + instruction" is used directly, without any additional supplementation or emphasis.

Analysis Process:
The LLM is invoked with the query "original text + instruction."
All generated text (including analysis text and instructions) is produced, with instructions and user requirements enclosed in <> and ++, respectively, to facilitate subsequent extraction of the generated prompt and user requirements.

Generated Instruction (!!! This is the key reference point):
The instruction extracted from the content generated in step 3, obtained from the <> markers.

Effect of Applying the Instruction:
The prompt obtained in step 4 is directly fed into the Deepseek R1 32B model for inference, and the results are used to evaluate the quality of the model-generated prompt.

IV. Evaluation
First, the problem is manually broken down into specific steps to guide the model in processing and obtaining intermediate results (listing relationships and attributes). Summarizing these intermediate results effectively analyzes complex problems (a manual chain of thought).

Next, the generated prompt specifies the entities, relationships, and attributes in the original text and provides direction for graph construction (the form of the graph), thereby optimizing the user experience.

Then, from the perspective of the final output, if the user's requirement is analysis + graph-construction statements, the current approach works well. However, if the user requests less analysis and only the graph-construction statements, further modifications are needed.

Finally, the average cost per generated prompt is 2,532 tokens (positively correlated with the input text length).
The average original text length is 537 tokens, resulting in a consumption ratio of 1:4.71.

Comparison of the Two Approaches:
Using purely the large model to construct the prompt results in more descriptive language beyond the graph-construction statements but also retains a wider variety of entities and establishes more relationships.

In contrast, the approach of matching the most similar task template from the template library helps reduce descriptive language related to graph-construction statements.

Copilot

Pull Request Overview

This PR implements a semi-automated prompt generation tool that extracts user instructions and generates knowledge graph construction prompts for downstream processing with a large model.

Introduces an API function to interact with external services via HTTP requests.
Implements TF-IDF–based template matching to select the most similar prompt template.
Constructs multi-step prompt generation and Neo4j statement conversion based on user input.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
New_semi-automated generated extracted prompt2.py	Added modules for API calls, template matching, prompt extraction, and graph construction using TF-IDF and regex.
New_semi-automated generated extracted prompt1.py	Similar implementation as prompt2.py with minor differences in the graph construction prompt phrasing.

Copilot · 2025-06-10T14:11:13Z

+    result=API(BuildGraph4User)
+    df.loc[index,'应用该指令效果']=result
+
+    df.to_excel(r'C:\Data_Test.xlsx', index=False)


Consider moving the Excel write operation outside of the for loop to reduce unnecessary I/O operations, especially if processing a large dataset.

Suggested change

df.to_excel(r'C:\Data_Test.xlsx', index=False)

# Write the modified DataFrame to the Excel file after processing all rows

df.to_excel(r'C:\Data_Test.xlsx', index=False)

Copilot · 2025-06-10T14:11:14Z

+    result=API(BuildGraph4User)
+    df.loc[index,'应用该指令效果']=result
+
+    df.to_excel(r'C:\Data_Test.xlsx', index=False)


Consider moving the Excel write operation outside of the for loop to reduce repetitive disk writes and improve performance.

Suggested change

df.to_excel(r'C:\Data_Test.xlsx', index=False)

# Write the updated DataFrame to Excel after processing all rows

df.to_excel(r'C:\Data_Test.xlsx', index=False)

df.to_excel(r'C:\Data_Test.xlsx', index=False)

Copilot · 2025-06-10T14:11:14Z

+from sklearn.feature_extraction.text import TfidfVectorizer
+from sklearn.metrics.pairwise import cosine_similarity
+
+def API(question):


The API function appears identical to the one in another file; consider refactoring it into a shared module to reduce duplication.

Copilot · 2025-06-10T14:11:14Z

+import re
+import pandas as pd
+
+def API(question):


The API function is duplicated across files; consider consolidating this functionality into a common module to enhance code maintainability.

imbajin · 2025-07-25T11:02:44Z

finished by #281

dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jun 10, 2025

Diana9513 changed the title ~~提交工作代码和测试数据~~ An Implementation of Semi-Automated Generated Extracted Prompts Jun 10, 2025

Diana9513 force-pushed the sbq branch 2 times, most recently from 86098ac to dd98ea6 Compare June 10, 2025 14:08

imbajin requested a review from Copilot June 10, 2025 14:09

Copilot AI reviewed Jun 10, 2025

View reviewed changes

Diana9513 force-pushed the sbq branch from 53f39c9 to 587b804 Compare June 12, 2025 21:53

github-actions Bot added the llm label Jun 12, 2025

An Implementation of Semi-Automated Generated Extracted Prompts

7881b24

Diana9513 force-pushed the sbq branch from 587b804 to 7881b24 Compare June 12, 2025 22:14

Merge branch 'main' into sbq

4a0b413

imbajin closed this Jul 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

An Implementation of Semi-Automated Generated Extracted Prompts#271

An Implementation of Semi-Automated Generated Extracted Prompts#271
Diana9513 wants to merge 2 commits intoapache:mainfrom
Diana9513:sbq

Diana9513 commented Jun 10, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

Copilot AI Jun 10, 2025

Uh oh!

imbajin commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	df.to_excel(r'C:\Data_Test.xlsx', index=False)
	# Write the modified DataFrame to the Excel file after processing all rows
	df.to_excel(r'C:\Data_Test.xlsx', index=False)

Conversation

Diana9513 commented Jun 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Jun 10, 2025

Choose a reason for hiding this comment

Uh oh!

imbajin commented Jul 25, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Diana9513 commented Jun 10, 2025 •

edited

Loading