其他语言任务#

在本小节中,我们以主题分析任务和上下文学习为例,演示语言模型的加载和推理过程。对于其他语言任务,均可在huggingface平台搜索到类似的教程文档以及代码。

import torch.nn.functional as F
from transformers import (
    BertTokenizer,
    GPT2LMHeadModel, 
    TextGenerationPipeline,
    AutoTokenizer, 
    AutoModelForSequenceClassification, 
    AutoModelForSeq2SeqLM,
    pipeline
    )

1 主题分析任务#

使用transformers管道pipeline快速实现语言任务

# 从huggingface平台上找到对应的模型路径
model_path = 'model/roberta-base-finetuned-chinanews-chinese'

# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)

# 利用pipeline快速进行语言任务
text = '欢迎参加工作坊!'
text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
res = text_classification(text)[0]
print("="*20, "单个句子主题分析计算", "="*20)
print(f"\nInput: {text}\nPrediction: {res['label']}, Score: {res['score']:.3f}")


# pipeline可以实现批量句子的计算
text_lst = ['2023年心理语言学会在广州召开', '湖人有意签保罗补强,联手詹姆斯追逐总冠军']
res_lst = text_classification(text_lst)
print("\n\n")
print("="*20, "多个句子批量进行主题分析计算", "="*20)
for text, res in zip(text_lst, res_lst):
    print(f"\nInput: {text}\nPrediction: {res['label']}, Score: {res['score']:.3f}")
==================== 单个句子主题分析计算 ====================

Input: 欢迎参加工作坊!
Prediction: culture, Score: 0.723



==================== 多个句子批量进行主题分析计算 ====================

Input: 2023年心理语言学会在广州召开
Prediction: culture, Score: 0.969

Input: 湖人有意签保罗补强,联手詹姆斯追逐总冠军
Prediction: sports, Score: 1.000

2 上下文学习#

通过在上下文中给定任务描述和示例,通用的文本生成模型可以根据上下文快速学习语言任务。在这里我们不使用pipeline,直接调用模型方法进行计算。

# 从huggingface平台上找到对应的模型路径
model_path = "model/flan-t5-large"

# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)


print("\n\n")
print("="*20, "上下文学习实现文本翻译", "="*20)
text = "translate English to German: How old are you?"

# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids

# 调用模型的generate方法
outputs = model.generate(input_ids)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Input: {text}\nOutput: {decoded_output}")



print("\n\n")
print("="*20, "上下文学习实现主题文本生成", "="*20)
text = '''Generate sentences with the topic : 
sports => Lionel Messi and MLS club Inter Miami are discussing possible signing
entertainment => 
'''

# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids

# 调用模型的generate方法
outputs = model.generate(input_ids)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Input: {text}\nOutput: {decoded_output}")
==================== 上下文学习实现文本翻译 ====================
/home/zhang/anaconda3/envs/ngram/lib/python3.7/site-packages/transformers/generation/utils.py:1278: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 20 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
  UserWarning,
Input: translate English to German: How old are you?
Output: Wie alte sind Sie?



==================== 上下文学习实现主题文本生成 ====================
Input: Generate sentences with the topic : 
sports => Lionel Messi and MLS club Inter Miami are discussing possible signing
entertainment => 

Output: a new tv series starring adrian sandler is 

3 文本生成超参数#

在本小节中,我们会分析文本生成中的温度参数、搜索策略参数以及top-p参数对文本生成结果的影响。

# 从huggingface平台上找到对应的模型路径
model_path = "model/flan-t5-large"

# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)

text = 'Welcome to '

# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids

# 其余可修改参数包括top_k, top_p等, 可直接在.generate()方法中调用
# ref: https://huggingface.co/blog/how-to-generate
print(f'\nInput: {text}\n')
print("="*20, "贪婪搜索", "="*20)
for iter in range(5):
    outputs = model.generate(input_ids, max_length=10)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
    print(f"Iter {iter}: {decoded_output}")
    
print("="*20, "随机搜索, 温度参数=0.1", "="*20)
for iter in range(5):
    outputs = model.generate(input_ids, do_sample=True, temperature=0.1, max_length=10)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
    print(f"Iter {iter}: {decoded_output}")
    

print("="*20, "随机搜索, 温度参数=1.0", "="*20)
for iter in range(5):
    outputs = model.generate(input_ids, do_sample=True, temperature=1.0, max_length=10)
    decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
    print(f"Iter {iter}: {decoded_output}")
Input: Welcome to 

==================== 贪婪搜索 ====================
Iter 0: Welcome to the e-commerce world!
Iter 1: Welcome to the e-commerce world!
Iter 2: Welcome to the e-commerce world!
Iter 3: Welcome to the e-commerce world!
Iter 4: Welcome to the e-commerce world!
==================== 随机搜索, 温度参数=0.1 ====================
Iter 0: Welcome to the iStockphoto
Iter 1: Welcome to the official website of the 
Iter 2: Welcome to the world of e-commerce
Iter 3: Welcome to the world of e-commerce
Iter 4: Welcome to the official website of the 
==================== 随机搜索, 温度参数=1.0 ====================
Iter 0: Welcome to World of Aliens! a
Iter 1: Welcome to the new website.  All
Iter 2: Welcome to the Wikimedia Foundation! This
Iter 3: Hi, I am Jason, the owners of
Iter 4: Welcome to the Official Website of eW