其他语言任务#
在本小节中,我们以主题分析任务和上下文学习为例,演示语言模型的加载和推理过程。对于其他语言任务,均可在huggingface平台搜索到类似的教程文档以及代码。
import torch.nn.functional as F
from transformers import (
BertTokenizer,
GPT2LMHeadModel,
TextGenerationPipeline,
AutoTokenizer,
AutoModelForSequenceClassification,
AutoModelForSeq2SeqLM,
pipeline
)
1 主题分析任务#
使用transformers管道pipeline快速实现语言任务
# 从huggingface平台上找到对应的模型路径
model_path = 'model/roberta-base-finetuned-chinanews-chinese'
# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSequenceClassification.from_pretrained(model_path)
# 利用pipeline快速进行语言任务
text = '欢迎参加工作坊!'
text_classification = pipeline('sentiment-analysis', model=model, tokenizer=tokenizer)
res = text_classification(text)[0]
print("="*20, "单个句子主题分析计算", "="*20)
print(f"\nInput: {text}\nPrediction: {res['label']}, Score: {res['score']:.3f}")
# pipeline可以实现批量句子的计算
text_lst = ['2023年心理语言学会在广州召开', '湖人有意签保罗补强,联手詹姆斯追逐总冠军']
res_lst = text_classification(text_lst)
print("\n\n")
print("="*20, "多个句子批量进行主题分析计算", "="*20)
for text, res in zip(text_lst, res_lst):
print(f"\nInput: {text}\nPrediction: {res['label']}, Score: {res['score']:.3f}")
==================== 单个句子主题分析计算 ====================
Input: 欢迎参加工作坊!
Prediction: culture, Score: 0.723
==================== 多个句子批量进行主题分析计算 ====================
Input: 2023年心理语言学会在广州召开
Prediction: culture, Score: 0.969
Input: 湖人有意签保罗补强,联手詹姆斯追逐总冠军
Prediction: sports, Score: 1.000
2 上下文学习#
通过在上下文中给定任务描述和示例,通用的文本生成模型可以根据上下文快速学习语言任务。在这里我们不使用pipeline,直接调用模型方法进行计算。
# 从huggingface平台上找到对应的模型路径
model_path = "model/flan-t5-large"
# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
print("\n\n")
print("="*20, "上下文学习实现文本翻译", "="*20)
text = "translate English to German: How old are you?"
# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids
# 调用模型的generate方法
outputs = model.generate(input_ids)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Input: {text}\nOutput: {decoded_output}")
print("\n\n")
print("="*20, "上下文学习实现主题文本生成", "="*20)
text = '''Generate sentences with the topic :
sports => Lionel Messi and MLS club Inter Miami are discussing possible signing
entertainment =>
'''
# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids
# 调用模型的generate方法
outputs = model.generate(input_ids)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Input: {text}\nOutput: {decoded_output}")
==================== 上下文学习实现文本翻译 ====================
/home/zhang/anaconda3/envs/ngram/lib/python3.7/site-packages/transformers/generation/utils.py:1278: UserWarning: Neither `max_length` nor `max_new_tokens` has been set, `max_length` will default to 20 (`generation_config.max_length`). Controlling `max_length` via the config is deprecated and `max_length` will be removed from the config in v5 of Transformers -- we recommend using `max_new_tokens` to control the maximum length of the generation.
UserWarning,
Input: translate English to German: How old are you?
Output: Wie alte sind Sie?
==================== 上下文学习实现主题文本生成 ====================
Input: Generate sentences with the topic :
sports => Lionel Messi and MLS club Inter Miami are discussing possible signing
entertainment =>
Output: a new tv series starring adrian sandler is
3 文本生成超参数#
在本小节中,我们会分析文本生成中的温度参数、搜索策略参数以及top-p参数对文本生成结果的影响。
# 从huggingface平台上找到对应的模型路径
model_path = "model/flan-t5-large"
# 使用transformers工具包加载模型
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForSeq2SeqLM.from_pretrained(model_path)
text = 'Welcome to '
# 调用模型分词器,对输入文本进行分词并转换为模型可处理的tensor形式
input_ids = tokenizer(text, return_tensors="pt").input_ids
# 其余可修改参数包括top_k, top_p等, 可直接在.generate()方法中调用
# ref: https://huggingface.co/blog/how-to-generate
print(f'\nInput: {text}\n')
print("="*20, "贪婪搜索", "="*20)
for iter in range(5):
outputs = model.generate(input_ids, max_length=10)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Iter {iter}: {decoded_output}")
print("="*20, "随机搜索, 温度参数=0.1", "="*20)
for iter in range(5):
outputs = model.generate(input_ids, do_sample=True, temperature=0.1, max_length=10)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Iter {iter}: {decoded_output}")
print("="*20, "随机搜索, 温度参数=1.0", "="*20)
for iter in range(5):
outputs = model.generate(input_ids, do_sample=True, temperature=1.0, max_length=10)
decoded_output = tokenizer.decode(outputs[0], skip_special_tokens = True)
print(f"Iter {iter}: {decoded_output}")
Input: Welcome to
==================== 贪婪搜索 ====================
Iter 0: Welcome to the e-commerce world!
Iter 1: Welcome to the e-commerce world!
Iter 2: Welcome to the e-commerce world!
Iter 3: Welcome to the e-commerce world!
Iter 4: Welcome to the e-commerce world!
==================== 随机搜索, 温度参数=0.1 ====================
Iter 0: Welcome to the iStockphoto
Iter 1: Welcome to the official website of the
Iter 2: Welcome to the world of e-commerce
Iter 3: Welcome to the world of e-commerce
Iter 4: Welcome to the official website of the
==================== 随机搜索, 温度参数=1.0 ====================
Iter 0: Welcome to World of Aliens! a
Iter 1: Welcome to the new website. All
Iter 2: Welcome to the Wikimedia Foundation! This
Iter 3: Hi, I am Jason, the owners of
Iter 4: Welcome to the Official Website of eW