Llm Detection Proxy 源码阅读

Posted on 11月 23, 2024

llm Detection Proxy是 Elastic 在 LLM 工作流中嵌入安全性的一个POC(Proof Of Concept) Project 用于检测在在LLM api 流量中的恶意活动的一部分。组件包括LLM Api proxy 和 LLM 对话 Detection 以及 Elastic 连接器。以下主要针对 LLM 对话 Detection 进行分析。

检测入口

llm Detection Proxy 中检测的逻辑都在 analyze_and_enrich_request 函数中

入参

messages【-1】: LLM 对话中最后一条消息
response_content: LLM 对话返回的内容
error_response: LLM 对话返回的错误信息

    # Perform additional analysis and create the Elastic document
    additional_analysis = analyze_and_enrich_request(
        prompt=messages[-1],
        response_text=response_content,
        error_response=error_response,
    )

检测过程

llm Detection Proxy 的检测过程使用了两个开源的第三方分析和保护LLM 交互的工具进行检测

LLM Guard 检测

首先是初始化了 input_scanners 和 output_scanners 两个list 来存储输入和输出的检测器

input_scanners 检测器list

Anonymize 匿名扫描器
Toxicity 内容毒性扫描器
TokenLimit Token 限制扫描器
PromptInjection 提示词注入扫描器

output_scanners 检测器list

Deanonymize 解匿名扫描器
NoRefusal 无拒绝扫描仪
Relevance 相关性扫描器
Sensitive 敏感信息扫描器

vault = Vault()
input_scanners = [Anonymize(vault), Toxicity(), TokenLimit(), PromptInjection()]
output_scanners = [Deanonymize(vault), NoRefusal(), Relevance(), Sensitive()]

# LLM Guard analysis
sanitized_prompt, results_valid_prompt, results_score_prompt = scan_prompt(
    input_scanners, prompt["content"]
)
(
    sanitized_response_text,
    results_valid_response,
    results_score_response,
) = scan_output(output_scanners, sanitized_prompt, response_text)

scan_prompt

扫描器按照它们传递给 scan_prompt 函数的顺序执行，并传入prompt 得到三个返回值

sanitized_prompt 应用所有扫描器后处理的提示字符串。扫描器可能会对Prompt进行修改？
results_valid_prompt 将扫描器名称映射到布尔值的字典 False 表示存在风险
results_score_prompt 扫描器名称映射到风险分数浮点值的字典 0 表示无风险，1 表示高风险

scan_output

scan_output 传入了 response_text 以及可能被修改的 sanitized_prompt，同样返回三个返回值与scan_prompt意义相同

langkit 检测

将 prompt 传入到 langkit injections 进行检测会返回一个风险分数,这个分数是通过与一组已知的越狱尝试和有害行为之间的最大相似分数得分相似度得分是通过计算从提示生成的嵌入之间的余弦相似度来计算的。嵌入是通过使用 hugginface 的模型 sentence-transformers/all-MiniLM-L6-v2 生成的

# LangKit for additional analysis
schema = injections.init()
langkit_result = extract({"prompt": prompt["content"]}, schema=schema)

prompt_injection_score = langkit_result.get("prompt.injection", 0)

检测结果

💡 其中 0 表示无风险，1 表示高风险。

analysis 分析结果
- llm_guard_prompt_scores guard 扫描得到的请求Prompt的名称映射到风险分数浮点值的字典
- llm_guard_response_scores guard 扫描得到的响应内容的名称映射到风险分数浮点值的字典
- langkit_score langkit 扫描得到的 Prompt injection 风险评分
malicious 是否是恶意
identified_threats 恶意的扫描器名称列表

    # Combine results and enrich document
    # llm_guard scores map scanner names to float values of risk scores,
    # where 0 is no risk, and 1 is high risk.
    # langkit_score is a float value of the risk score for prompt injection
    # based on known threats.
    enriched_document = {
        "analysis": {
            "llm_guard_prompt_scores": results_score_prompt, # prompt 扫描得到的名称映射到风险分数浮点值的字典  { "scanner_name": score }
            "llm_guard_response_scores": results_score_response, # response 扫描得到的名称映射到风险分数浮点值的字典  { "scanner_name": score }
            "langkit_score": prompt_injection_score, # langkit 扫描得到的风险分数
        },
        "malicious": any(identified_threats), # 是否是恶意 通过 identified_threats 得到 
        "identified_threats": identified_threats, # 扫描得到的恶意风险列表
    }

结语

Llm Detection Proxy 源码先到这里实际上在proxry中只是依赖了LLM-Guard 与 LangKit 这两个第三方工具进行了恶意活动的检测

同样Elastic 也分享了另外一些分析和保护 LLM 交互的工具

Redenial 利用机器学习来识别和减少通过 LLM 交互进行社会工程、网络钓鱼和其他恶意活动的企图。示例用法包括通过 Rebuff 的分析引擎传递请求内容，并根据结果使用 “malicious” 布尔字段标记请求。

Vigil-LLM 专注于对可疑 LLM 请求的实时监控和警报。集成到代理层可以立即标记潜在的安全问题，使用警戒分数丰富请求数据。

Open-Prompt Injection 提供检测提示注入攻击的方法和工具，允许使用与提示注入技术相关的特定危害指标来丰富请求数据。

LLM-Guard 使用了基于规则的引擎进行检测，LangKit而利用机器学习进行检测使用规则加机器学习进行检测也是主流恶意活动检测的方法，

然后这只是Elastic LLM Detection 实践的一部分通常除了规则引擎与机器学习还有关联模型进行检测

另一部分编写 ES|稍后可用于响应的 QL 检测规则见在 LLM 工作流中嵌入安全性：Elastic 的主动方法

附录

版权信息

本文原载于 not only security，复制请保留原文出处。