PythonTM

1 文本挖掘概述

1.1 什么是文本挖掘
1.2 文本挖掘的基本流程和任務
1.3 文本挖掘的基本思路
1.4 原始語料數據化時需要考慮的工作
1.5 Python文本挖掘的正確打開姿勢

2 磨刀不誤砍柴工：工具準備

2.5.1 什么是語料庫
2.5.2 常見的語料庫格式
2.5.3 準備《射雕》語料庫

2.1 Python的常見IDE簡介
2.2 Anaconda的安裝與配置
2.3 Jupyter Notebook的基本操作
2.4 NLTK包的安裝
2.5 語料庫的準備
2.6 實戰(zhàn)：準備工具與素材

3 分詞

3.4.1 常見的停用詞種類
3.4.2 分詞后去除停用詞
3.4.3 用extract_tags函數去除停用詞

3.3.1 動態(tài)增刪新詞
3.3.2 使用自定義詞典
3.3.3 使用搜狗細胞詞庫

3.1 分詞原理簡介
3.2 結巴分詞的基本用法
3.3 修改詞典
3.4 去除停用詞
3.5 詞性標注
3.6 分詞的NLTK實現
3.7 實戰(zhàn)：《射雕》一書分詞

4 詞云展示

4.5.1 設置背景圖片
4.5.2 指定圖片色系
4.5.3 指定單詞組顏色

4.4.1 WordCloud的基本語法
4.4.2 用原始文本直接分詞并繪制
4.4.3 基于分詞頻數繪制

4.1.1 使用Pandas統計
4.1.2 使用NLTK統計

4.1 詞頻統計
4.2 詞云概述
4.3 wordcloud包的安裝
4.4 繪制詞云
4.5 詞云的美化
4.6 實戰(zhàn)：優(yōu)化射雕詞云

5 文檔信息的向量化

5.5.1 什么是分布式表示
5.5.2 共現矩陣
5.5.3 NNLM模型
5.5.4 CBOW模型

5.3.1 用Pandas庫實現
5.3.2 用sklearn庫實現

5.1 詞袋模型
5.2 詞袋模型的gensim實現
5.3 生成文檔-詞條矩陣
5.4 從詞袋模型到N-gram
5.5 文檔信息的分布式表示
5.6 實戰(zhàn)：生成詞向量

6 關鍵詞提取

6.3.1 jieba
6.3.2 sklearn
6.3.3 gensim

6.1 關鍵詞提取的基本思路
6.2 TF-IDF 算法
6.3 TF-IDF的具體實現
6.4 TextRank算法
6.5 實戰(zhàn)練習

7 抽取文檔主題

7.1 主題模型的基本概念
7.2 sklearn實現
7.3 gensim實現
7.4 結果的圖形化呈現
7.5 實戰(zhàn)練習

8 文檔相似度

8.3.1 基于詞袋模型計算
8.3.2 doc2vec

8.1 基本概念
8.2 詞條相似度：word2vec
8.3 文檔相似度
8.4 文檔聚類
8.5 實戰(zhàn)練習

9 文檔分類

9.1 文檔分類方法概述
9.2 樸素貝葉斯算法
9.3 sklearn實現
9.4 NLTK實現
9.5 實戰(zhàn)作業(yè)

10 情感分析

10.1 情感分析概述
10.2 基于詞袋模型的分析
10.3 基于分布式表達的分析
10.4 實戰(zhàn)作業(yè)

11 文檔自動摘要

11.1 自動摘要的基本原理
11.2 自動摘要的效果評價
11.3 自動摘要的python實現
11.4 實戰(zhàn)作業(yè)

12 自動寫作

12.3.1 文本預處理
12.3.2 構造訓練測試集
12.3.3 建立LSTM模型
12.3.4 進行文本預測

12.2.1 文本預處理
12.2.2 構造訓練測試集
12.2.3 建立LSTM模型
12.2.4 進行文本預測

12.1.1 應用場景
12.1.2 RNN的基本原理
12.1.3 LSTM的基本原理

12.1 自動寫作的基本原理
12.2 用LSTM實現英文寫作
12.3 將LSTM與word2vec結合實現中文自動寫作
12.4 實戰(zhàn)作業(yè)

文本挖掘概述?

什么是文本挖掘?

文本挖掘的基本流程和任務?

文本挖掘的基本思路?

原始語料數據化時需要考慮的工作?

Python文本挖掘的正確打開姿勢?

磨刀不誤砍柴工：工具準備?

Python的常見IDE簡介?

Anaconda的安裝與配置?

Jupyter Notebook的基本操作?

NLTK包的安裝?

什么是NLTK?

一個完整的?然語?處理框架

?帶語料庫，詞性分類庫
?帶分類，分詞，等等功能
有強?的社區(qū)?持
框架設計上沒有考慮中文

NLTK的主要模塊?

如何安裝NLTK?

Anaconda中已經默認安裝

pip install nltk
nltk.download()

In [ ]:

# 學員交流微信群：請加拉群專用微信號 bigdatastar # 文彤老師微信公眾號：統計之星import nltknltk.download()

In [ ]:

nltk.__version__

NLTK的替代包?

NLTK的使用稍顯復雜，初學者也可以使用各種基于NLTK的簡化版wrapper

TextBlob

語料庫的準備?

什么是語料庫?

In [ ]:

# 布朗語料庫示例from nltk.corpus import brownbrown.categories()

In [ ]:

len(brown.sents())

In [ ]:

len(brown.words())

常見的語料庫格式?

外部文件

除直接網絡抓取并加工的情況外，原始文檔由于內容較多，往往會首先以單個/多個文本文件的形式保存在外部，然后讀入程序

list

結構靈活松散，有利于對原始語料進行處理，也可以隨時增刪成員

[
'大魚吃小魚也吃蝦米，小魚吃蝦米。',
'我?guī)湍?，你也幫我?#39;
]

list of list

語料完成分詞后的常見形式，每個文檔成為詞條構成的list，而這些list又是原文檔list的成員

[
['大魚', '吃', '小魚', '也', '吃', '蝦米', '，', '小魚', '吃', '蝦米', '。'],
['我', '幫', '你', '，', '你', '也', '幫', '我', '。']
]

DataFrame

使用詞袋模型進行后續(xù)數據分析時常見格式，行/列代表語料index，相應的列/行代表詞條，或者需要加以記錄的文檔屬性，如作者，原始超鏈接，發(fā)表日期等
詞條/文檔對應時，單元格記錄相應的詞條出現頻率，或者相應的概率/分值
    Doc2Term矩陣
    Term2Doc矩陣
可以和原始語料的外部文件/list配合使用

對于單個文檔，也可以建立DataFrame，用行/列代表一個句子/段落/章節(jié)。

準備《射雕》語料庫?

為使用Python還不熟練的學員提供一個基于Pandas的通用操作框架。

讀入為數據框?

In [ ]:

import pandas as pd# 有的環(huán)境配置下read_table出錯，因此改用read_csvraw = pd.read_csv("金庸-射雕英雄傳txt精校版.txt",                  names = ['txt'], sep ='aaa', encoding ="GBK")print(len(raw))raw

加入章節(jié)標識?

In [ ]:

# 章節(jié)判斷用變量預處理def m_head(tmpstr):    return tmpstr[:1]def m_mid(tmpstr):    return tmpstr.find("回 ")# 注意：下面的raw.txt指的是raw數據框中的txt變量列，對pandas不熟悉的學員請復習相關知識raw['head'] = raw.txt.apply(m_head)raw['mid'] = raw.txt.apply(m_mid)raw['len'] = raw.txt.apply(len)# raw['chap'] = 0raw.head(50)

In [ ]:

# 章節(jié)判斷chapnum = 0for i in range(len(raw)):    if raw['head'][i] == "第" and raw['mid'][i] > 0 and raw['len'][i] < 30 :        chapnum += 1    if chapnum >= 40 and raw['txt'][i] == "附錄一：成吉思汗家族" :        chapnum = 0    raw.loc[i, 'chap'] = chapnum    # 刪除臨時變量，這里必須刪除，否則后續(xù)sum函數處會出錯del raw['head']del raw['mid']del raw['len']raw.head(50)

提取出所需章節(jié)?

In [ ]:

raw[raw.chap == 1].head()

In [ ]:

from matplotlib import pyplot as plt%matplotlib inlineraw.txt.agg(len).plot.box()

In [ ]:

rawgrp = raw.groupby('chap')chapter = rawgrp.agg(sum) # 只有字符串列的情況下，sum函數自動轉為合并字符串，對pandas不熟悉的學員請復習相關知識chapter = chapter[chapter.index != 0]chapter.txt[1]

實戰(zhàn)：準備工具與素材?

請自行完成分析用Anaconda環(huán)境的安裝和配置。

請自行熟悉Jupyter notebook環(huán)境的操作。

自行提取《射雕》任意一回的文字，并完成如下操作：

將其讀入為按整句分案例行的數據框格式，并用另一個變量標識其所在段落的流水號。
將上述數據框轉換為以整段為成員的list格式。

說明：

最后一題主要涉及到Pandas的操作，對該模塊不熟悉的學員可直接繼續(xù)后續(xù)課程的學習，這部分知識的欠缺并不會影響對文本挖掘課程本身的學習。當然，能懂得相應的知識是最好的。

分詞?

分詞原理簡介?

結巴分詞的基本用法?

jieba是目前應用最廣，評價也較高的分詞工具包

安裝?

https://pypi.python.org/pypi/jieba/

pip install jieba

基本特點?

三種分詞模式

精確模式，試圖將句子最精確地切開，適合做文本分析
全模式，把句子中所有的可以成詞的詞語都掃描出來，速度非?？?，但是不能解決歧義
搜索引擎模式，在精確模式的基礎上，對長詞再次切分，提高召回率，適合用于搜索引擎分詞

支持繁體分詞

支持自定義詞典

In [ ]:

import jiebatmpstr = "郭靖和哀牢山三十六劍。"res = jieba.cut(tmpstr) # 精確模式print(res) # 是一個可迭代的 generator，可以使用 for 循環(huán)來遍歷結果，本質上類似list

In [ ]:

print(' '.join(res))

In [ ]:

res = jieba.cut(tmpstr)list(word for word in res) # 演示generator的用法

In [ ]:

print(jieba.lcut(tmpstr)) # 結果直接輸出為list

In [ ]:

print('/'.join(jieba.cut(tmpstr, cut_all = True))) # 全模式

In [ ]:

# 搜索引擎模式，還有jieba.lcut_for_search可用print('/'.join(jieba.cut_for_search(tmpstr)))

修改詞典?

動態(tài)增刪新詞?

在程序中可以動態(tài)根據分詞的結果，對內存中的詞庫進行更新

add_word(word)

word：新詞
freq=None：詞頻
tag=None：具體詞性

del_word(word)

In [ ]:

# 動態(tài)修改詞典jieba.add_word("哀牢山三十六劍")'/'.join(jieba.cut(tmpstr))

In [ ]:

jieba.del_word("哀牢山三十六劍")'/'.join(jieba.cut(tmpstr))

使用自定義詞典?

load_userdict(file_name)

file_name：文件類對象或自定義詞典的路徑

詞典基本格式

一個詞占一行：詞、詞頻（可省略）、詞性（可省略），用空格隔開
詞典文件必須為 UTF-8 編碼
    必要時可以使用Uedit32進行文件編碼轉換

云計算 5
李小福 2 nr
easy_install 3 eng
臺中

In [ ]:

dict = '金庸小說詞庫.txt'jieba.load_userdict(dict) # dict為自定義詞典的路徑'/'.join(jieba.cut(tmpstr))

使用搜狗細胞詞庫?

https://pinyin.sogou.com/dict/

按照詞庫分類或者關鍵詞搜索方式，查找并下載所需詞庫

使用轉換工具，將其轉換為txt格式

深藍詞庫轉換
奧創(chuàng)詞庫轉換

在程序中導入相應詞庫

去除停用詞?

常見的停用詞種類?

超高頻的常用詞：基本不攜帶有效信息/歧義太多無分析價值

的、地、得

虛詞：如介詞，連詞等

只、條、件
當、從、同

專業(yè)領域的高頻詞：基本不攜帶有效信息

視情況而定的停用詞

呵呵
emoj

分詞后去除停用詞?

基本步驟

讀入停用詞表文件
正常分詞
在分詞結果中去除停用詞

新列表 = [ word for word in 源列表 if word not in 停用詞列表 ]

該方法存在的問題：停用詞必須要被分詞過程正確拆分出來才行

In [ ]:

newlist = [ w for w in jieba.cut(tmpstr) if w not in ['和', '。'] ] print(newlist)

In [ ]:

import pandas as pdtmpdf = pd.read_csv('停用詞.txt',                    names = ['w'], sep = 'aaa', encoding = 'utf-8')tmpdf.head()

In [ ]:

# 熟悉Python的可以直接使用 open('stopWord.txt').readlines()獲取停用詞list，效率更高[ w for w in jieba.cut(tmpstr) if w not in list(tmpdf.w) ]

用extract_tags函數去除停用詞?

方法特點：

根據TF-IDF算法將特征詞提取出來，在提取之前去掉停用詞
可以人工指定停用詞字典
jieba.analyse.set_stop_words()

In [ ]:

# 使用預先準備的停用詞表import jieba.analyse as anaana.set_stop_words('停用詞.txt')jieba.lcut(tmpstr) # 讀入的停用詞列表對分詞結果無效

In [ ]:

ana.extract_tags(tmpstr, topK = 20) # 使用TF-IDF算法提取關鍵詞，并同時去掉停用詞

詞性標注?

import jieba.posseg

posseg.cut()：給出附加詞性的分詞結果

詞性標注采用和 ICTCLAS 兼容的標記法

后續(xù)可基于詞性做進一步處理，如只提取出名詞，動詞等

In [ ]:

import jieba.posseg as psgtmpres = psg.cut(tmpstr) # 附加詞性的分詞結果print(tmpres)for item in tmpres:    print(item.word, item.flag)

In [ ]:

psg.lcut(tmpstr) # 直接輸出為list，成員為pair

分詞的NLTK實現?

NLTK只能識別用空格作為詞條分割方式，因此不能直接用于中文文本的分詞。

一般的做法是先用jieba分詞，然后轉換為空格分隔的連續(xù)文本，再轉入NLTK框架使用。

rawtext = '周伯通笑道：“你懂了嗎？...”
txt = ' '.join(jieba.cut(rawtext)) # "周伯通 笑 道 ：..."
toke = nltk.word_tokenize(txt) # ['周伯通', '笑', '道', '：'...]

實戰(zhàn)：《射雕》一書分詞?

選取第一回的文字，應用搜狗的細胞詞庫和停用詞表，清理出干凈的分詞結果。

選取第一回中最長的1個段落，比較不使用詞庫、不使用停用詞表前后的分詞結果。

熟悉搜狗細胞詞庫網站中的資源，思考哪些詞庫可能是自己需要的，下載相應的資源并進行格式轉換。

詞云展示?

詞頻統計?

絕大部分詞頻統計工具都是基于分詞后構建詞條的list進行，因此首先需要完成相應的分詞工作。

In [ ]:

import jieba#分詞word_list = jieba.lcut(chapter.txt[1])word_list[:10]

構建完list之后，也可以自行編寫詞頻統計程序，框架如下：

遍歷整個list，對每個詞條:
if word in d:
    d[word] += 1
else:
    d[word] = 1

使用Pandas統計?

In [ ]:

word_list = jieba.lcut(" ".join(raw.txt))

In [ ]:

word_list[:10]

In [ ]:

import pandas as pddf = pd.DataFrame(word_list, columns = ['word'])df.head(30)

In [ ]:

result = df.groupby(['word']).size()print(type(result))freqlist = result.sort_values(ascending=False)freqlist[:20]

In [ ]:

freqlist[freqlist.index == '道']

In [ ]:

freqlist[freqlist.index == '黃蓉道']

In [ ]:

jieba.add_word('道', freq = 50000)

使用NLTK統計?

NLTK生成的結果為頻數字典，在和某些程序包對接時比較有用

In [ ]:

import nltk# 分詞等預處理工作# 這里可以根據需要做任何的preprocessing: stopwords, lemma, stemming, etc.word_list[:10]

In [ ]:

fdist = nltk.FreqDist(word_list) # 生成完整的詞條頻數字典fdist

In [ ]:

# 帶上某個單詞, 可以看到它在整個文章中出現的次數fdist['顏烈']

In [ ]:

fdist.keys() # 列出詞條列表

In [ ]:

fdist.tabulate(10)

In [ ]:

fdist.most_common(5)

詞云概述?

wordcloud包的安裝?

安裝?

警告：wordcloud的安裝有可能非常順利，也有可能非常痛苦，完全是拼人品的一件事情。。。

方法1：pip install wordcloud

安裝后很可能不能用，直接成功的話，您的人品實在是爆棚

方法2：python setup.py install

在github.com/amueller/word_cloud下載安裝包

方法3：下載第三方編譯好的whl文件進行安裝

https://www.lfd.uci.edu/~gohlke/pythonlibs/#wordcloud

Visual C++ build tools支持

提示：Microsoft Visual C++ 14.0 is required.
需要：Visual C++ 2015 Build Tools
文件：visualcppbuildtools_full.exe

ImportError: DLL load failed: 找不到指定的模塊。

pip uninstall pillow，然后重新安裝pillow包
或者uninstall pillow之后使用上面的方法2安裝，會自動安裝相關的支持包

中文字體支持?

.WordCloud(font_path='simhei.ttf')

需要帶路徑寫完整字體文件名
注意Win10的字體文件后綴可能不一樣

繪制詞云?

WordCloud的基本語法?

class wordcloud.WordCloud(

常用功能：
    font_path : 在圖形中使用的字體，默認使用系統字體 
    width / height = 200 : 圖形的寬度/高度
    max_words = 200 : 需要繪制的最多詞條數
    stopwords = None : 停用詞列表，不指定時會使用系統默認停用詞列表

字體設定：
    min_font_size = 4 /  max_font_size = None : 字符大小范圍
    font_step = 1 : 字號增加的步長
    relative_scaling = .5: 詞條頻數比例和字號大小比例的換算關系，默認為50%
    prefer_horizontal = 0.90 : 圖中詞條水平顯示的比例

顏色設定：    
    background_color = ”black” : 圖形背景色
    mode = ”RGB”: 圖形顏色編碼，如果指定為"RGBA"且背景色為None時，背景色為透明
    color_func = None : 生成新顏色的函數，使用matplotlib的colormap

背景掩模：
    mask = None : 詞云使用的背景圖（遮罩）

)

用原始文本直接分詞并繪制?

cloudobj = WordCloud().generate(text)

generate實際上是generate_from_text的別名

文本需要用空格/標點符號分隔單詞，否則不能正確分詞

In [ ]:

import wordcloudmyfont = r'C:\Windows\Fonts\simkai.ttf'text = 'this is shanghai, 郭靖, 和, 哀牢山 三十六劍'cloudobj = wordcloud.WordCloud(font_path = myfont).generate(text)  print(cloudobj)

顯示詞云?

import matplotlib.pyplot as plt

plt.imshow(cloudobj)

plt.close()

In [ ]:

import matplotlib.pyplot as pltplt.imshow(cloudobj)plt.axis("off")plt.show()

In [ ]:

# 更改詞云參數設定cloudobj = wordcloud.WordCloud(font_path = myfont, 
    width = 360, height = 180,    mode = "RGBA", background_color = None).generate(text)  plt.imshow(cloudobj)plt.axis("off")plt.show()

保存詞云?

wordcloud.to_file(保存文件的路徑與名稱) 該命令保存的是高精度圖形

In [ ]:

cloudobj.to_file("詞云.png")# wordcloud.WordCloud(font_path = myfont).generate(text).to_file(r"詞云.png")

生成射雕第一章的詞云?

In [ ]:

import pandas as pdimport jiebastoplist = list(pd.read_csv('停用詞.txt', names = ['w'], sep = 'aaa', 
                            encoding = 'utf-8', engine='python').w)def m_cut(intxt):    return [ w for w in jieba.cut(intxt) if w not in stoplist ]

In [ ]:

cloudobj = wordcloud.WordCloud(font_path = myfont, 
    width = 1200, height = 800,    mode = "RGBA", background_color = None,    stopwords = stoplist).generate(' '.join(jieba.lcut(chapter.txt[1])))  plt.imshow(cloudobj)plt.axis("off")plt.show()

In [ ]:

cloudobj.to_file("詞云2.png")

基于分詞頻數繪制?

generate()的實際操作

調用分詞函數process_text()
調用基于頻數的繪制函數fit_words()

fit_words(dict)

實際上是generate_from_frequencies的別名
Dict: 由詞條和頻數構成的字典

In [ ]:

#基于分詞頻數繪制詞云txt_freq = {'張三':100,'李四':90,'王二麻子':50}cloudobj = wordcloud.WordCloud(font_path = myfont).fit_words(txt_freq)plt.imshow(cloudobj)plt.axis("off")plt.show()

用頻數生成射雕第一章的詞云?

In [ ]:

import nltkfrom nltk import FreqDisttokens = m_cut(chapter.txt[1])fdist = FreqDist(tokens) # 生成完整的詞條頻數字典type(fdist)

In [ ]:

cloudobj = wordcloud.WordCloud(font_path = myfont).fit_words(fdist)plt.imshow(cloudobj)plt.axis("off")plt.show()

詞云的美化?

各種詳細的操作和設定可以參考官網的案例：

https://amueller.github.io/word_cloud/

設置背景圖片?

Mask / 掩模 / 遮罩

用于控制詞云的整體形狀
指定mask后，設置的寬高值將被忽略，遮罩形狀被指定圖形的形狀取代。除全白的部分仍然保留外，其余部分會用于繪制詞云。因此背景圖片的畫布一定要設置為白色（#FFFFFF）
字的大小，布局和顏色也會基于Mask生成
必要時需要調整顏色以增強可視效果

基本調用方式

from scipy.misc import imread
mask = imread(背景圖片名稱)

In [ ]:

from imageio import imreaddef m_cut(intxt):    return [ w for w in jieba.cut(intxt) if w not in stoplist and len(w) > 1 ] cloudobj = wordcloud.WordCloud(font_path = myfont, 
    mask = imread("射雕背景1.png"), 
    mode = "RGBA", background_color = None    ).generate(' '.join(m_cut(chapter.txt[1]))) plt.imshow(cloudobj)plt.axis("off")plt.show()

指定圖片色系?

讀取指定圖片的色系設定

imgarray = np.array(imread(imgfilepath))

獲取圖片顏色

bimgColors = wordcloud.ImageColorGenerator(imgarray)

重置詞云顏色

cloudobj.recolor(color_func=bimgColors)
# 利用已有詞云對象直接重繪顏色，輸出速度要比全部重繪快的多

In [ ]:

import numpy as npimgobj = imread("射雕背景2.png")image_colors = wordcloud.ImageColorGenerator(np.array(imgobj))cloudobj.recolor(color_func=image_colors)plt.imshow(cloudobj)plt.axis("off")plt.show()

指定單詞組顏色?

理想的狀況應該是分組比較詞頻，在兩組中都高頻的詞條在圖形中相互抵消。

Python目前只能實現詞條分組上色。

color_to_words = {

'#00ff00': ['顏烈', '武官', '金兵', '小人'],
'red': ['包惜弱', '郭嘯天', '楊鐵心', '丘處機']

} '#00ff00'為綠色的代碼

default_color = 'grey' # 其余單詞的默認顏色

cloudobj.recolor()

In [ ]:

# 官網提供的顏色分組類代碼，略有修改from wordcloud import get_single_color_funcclass GroupedColorFunc(object):    def __init__(self, color_to_words, default_color):        self.color_func_to_words = [            (get_single_color_func(color), set(words))            for (color, words) in color_to_words.items()]        self.default_color_func = get_single_color_func(default_color)    def get_color_func(self, word):        """Returns a single_color_func associated with the word"""        try:            color_func = next(                color_func for (color_func, words) in self.color_func_to_words                if word in words)        except StopIteration:            color_func = self.default_color_func        return color_func    def __call__(self, word, **kwargs):        return self.get_color_func(word)(word, **kwargs)####### 指定分組色系color_to_words = {    '#00ff00': ['顏烈', '武官', '金兵', '官兵'],    'red': ['包惜弱', '郭嘯天', '楊鐵心', '丘處機']}default_color = 'grey' # 指定其他詞條的顏色grouped_color_func = GroupedColorFunc(color_to_words, default_color)cloudobj.recolor(color_func=grouped_color_func)plt.imshow(cloudobj)plt.axis("off")plt.show()

實戰(zhàn)：優(yōu)化射雕詞云?

嘗試進一步清理分詞結果，并且只保留所有的名稱（人名、地名）。

提示：可以使用詞性標注功能，只保留名詞和未知詞性的詞。
     可以考慮對自定義詞典做優(yōu)化，通過強行調整權重等方法改善分詞效果。

將所有的人名按照藍色系，地名按照紅色系進行詞云繪制。

自行制作兩個純色圖片，分別為綠色和藍色，然后將其分別指定為繪圖所用的色系，觀察圖形效果。

嘗試使用不同的背景圖片作為掩模，思考怎樣的圖片才能使得繪圖效果最佳。

文檔信息的向量化?

詞袋模型?

詞袋模型的gensim實現?

gensim的安裝?

pip install genism

安裝完成后如果使用word2vec時報錯，建議去gensim官網下載MS windows install的exe程序進行安裝：https://pypi.python.org/pypi/gensim

建立字典?

Dictionary類用于建立word<->id映射關系，把所有單詞取一個set()，并對set中每個單詞分配一個Id號的map

class gensim.corpora.dictionary.Dictionary(

documents=None : 若干個被拆成單詞集合的文檔的集合，一般以list in list形式出現
prune_at=2000000 : 字典中的最大詞條容量

)

In [ ]:

from gensim.corpora import Dictionarytexts = [['human', 'interface', 'computer']]dct = Dictionary(texts)  # fit dictionarydct.num_nnz

Dictionary類的屬性?

token2id

dict of (str, int) – token -> tokenId.

id2token

dict of (int, str) – Reverse mapping for token2id, initialized in lazy manner to save memory.

dfs

dict of (int, int) – Document frequencies: token_id -> in how many documents contain this token.

num_docs

int – Number of documents processed.

num_pos

int – Total number of corpus positions (number of processed words).

num_nnz

int – Total number of non-zeroes in the BOW matrix.

In [ ]:

# 向字典增加詞條dct.add_documents([["cat", "say", "meow"], ["dog"]])  dct.token2id

轉換為BOW稀疏向量?

dct.doc2bow( # 轉換為BOW格式：list of (token_id, token_count)

document : 用于轉換的詞條list
allow_update = False : 是否直接更新所用字典
return_missing = False : 是否返回新出現的（不在字典中的）詞

)

輸出結果

[(0, 2), (1, 2)]，表明在文檔中id為0,1的詞匯各出現了2次，至于其他詞匯則沒有出現
return_missing = True時，輸出list of (int, int), dict of (str, int)

In [ ]:

dct.doc2bow(["this", "is", "cat", "not", "a", "dog"])

In [ ]:

dct.doc2bow(["this", "is", "cat", "not", "a", "dog"], return_missing = True)

轉換為BOW長向量?

可考慮的思路：

從稀疏格式自行轉換。
直接生成文檔-詞條矩陣。

doc2idx( # 轉換為list of token_id

document : 用于轉換的詞條list
unknown_word_index = -1 : 為不在字典中的詞條準備的代碼

輸出結果

按照輸入list的順序列出所出現的各詞條ID

In [ ]:

dct.doc2idx(["this", "is", "a", "dog", "not", "cat"])

生成文檔-詞條矩陣?

用Pandas庫實現?

基本程序框架：

原始文檔分詞并清理
拼接為同一個df
匯總并轉換為文檔-詞條矩陣格式
去除低頻詞

In [ ]:

chapter.head()

In [ ]:

# 設定分詞及清理停用詞函數# 熟悉Python的可以使用 open('stopWord.txt').readlines（） 獲取停用詞list，效率更高stoplist = list(pd.read_csv('停用詞.txt', names = ['w'], sep = 'aaa', 
                            encoding = 'utf-8', engine='python').w)import jieba def m_cut(intxt):    return [ w for w in jieba.cut(intxt) 
            if w not in stoplist and len(w) > 1 ]

In [ ]:

# 設定數據框轉換函數def m_appdf(chapnum):    tmpdf = pd.DataFrame(m_cut(chapter.txt[chapnum + 1]), columns = ['word'])    tmpdf['chap'] = chapter.index[chapnum] # 也可以直接 = chapnum + 1    return tmpdf

In [ ]:

# 全部讀入并轉換為數據框df0 = pd.DataFrame(columns = ['word', 'chap']) # 初始化結果數據框for i in range(len(chapter)):    df0 = df0.append(m_appdf(i))df0.tail()

In [ ]:

# 輸出為序列格式df0.groupby(['word', 'chap']).agg('size').tail(10)

In [ ]:

# 直接輸出為數據框t2d = pd.crosstab(df0.word, df0.chap)len(t2d)

In [ ]:

t2d.head()

In [ ]:

# 計算各詞條的總出現頻次，準備進行低頻詞刪減totnum = t2d.agg(func = 'sum', axis=1)totnum

In [ ]:

t2dclean = t2d.iloc[list(totnum >= 10)]t2dclean.T

用sklearn庫實現?

CountVectorizer類的基本用法?

文本信息在向量化之前很難直接納入建模分析，考慮到這一問題，專門用于數據挖掘的sklearn庫提供了一個從文本信息到數據挖掘模型之間的橋梁，即CountVectorizer類，通過這一類中的功能，可以很容易地實現文檔信息的向量化。

class sklearn.feature_extraction.text.CountVectorizer(

input = 'content' : {'filename', 'file', 'content'}
    filename為所需讀入的文件列表, file則為具體的文件名稱。
encoding='utf-8' : 文檔編碼
stop_words = None : 停用詞列表，當analyzer == 'word'時才生效

min_df / max_df : float in range [0.0, 1.0] or int, default = 1 / 1.0
    詞頻絕對值/比例的閾值，在此范圍之外的將被剔除
    小數格式說明提供的是百分比，如0.05指的就是5%的閾值

）

CountVectorizer.build_analyzer()

返回文本預處理和分詞的可調用函數

In [ ]:

from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer(min_df = 2) # 在兩個以上文檔中出現的才保留analyze = countvec.build_analyzer()analyze('郭靖 和 哀牢山 三十六 劍 。')

CountVectorizer.fit_transform(raw_documents)

對文檔進行學習（處理），返回term-document matrix
等價于先調用fit函數，然后再調用transform函數，但是效率更高

In [ ]:

countvec.fit(['郭靖 和 黃蓉 哀牢山 三十六 劍 。', '黃蓉 和 郭靖 郭靖'])

In [ ]:

countvec.get_feature_names() # 詞匯列表，實際上就是獲取每個列對應的詞條

In [ ]:

countvec.vocabulary_ # 詞條字典

In [ ]:

x = countvec.transform(['郭靖 和 黃蓉 哀牢山 三十六 劍 。', '黃蓉 和 郭靖 郭靖'])type(x)

In [ ]:

x.todense() # 將稀疏矩陣直接轉換為標準格式矩陣

In [ ]:

countvec.fit_transform(['郭靖 和 哀牢山 三十六 劍 。', '黃蓉 和 郭靖 郭靖']) # 一次搞定

使用sklearn生成射雕的章節(jié)d2m矩陣?

將章節(jié)文檔數據框處理為空格分隔詞條的文本格式

使用fit_transform函數生成bow稀疏矩陣

轉換為標準格式的d2m矩陣

In [ ]:

rawchap = [ " ".join(m_cut(w)) for w in chapter.txt.iloc[:5]] rawchap[0]

In [ ]:

from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer(min_df = 5) # 在5個以上章節(jié)中出現的才保留res = countvec.fit_transform(rawchap)res

In [ ]:

res.todense()

In [ ]:

countvec.get_feature_names()

從詞袋模型到N-gram?

文檔信息的分布式表示?

什么是分布式表示?

共現矩陣?

NNLM模型?

CBOW模型?

實戰(zhàn)：生成詞向量?

嘗試編制以下程序：

以段為單位依次讀入射雕第一章的內容。
為每一段分別生成bow稀疏向量。
生成稀疏向量的同時動態(tài)更新字典。

請自行編制bow稀疏向量和標準長向量互相轉換的程序。

請自行思考基于BOW的分析模型和基于分布式表示向量的模型在文本挖掘中的適用范圍和優(yōu)缺點。

在文檔詞條矩陣中可以看到許多類似“黃蓉道”、“黃蓉說”之類的詞條，請思考對此有哪些處理辦法。

關鍵詞提取?

關鍵詞提取的基本思路?

TF-IDF 算法?

TF-IDF的具體實現?

jieba, NLTK, sklearn, gensim等程序包都可以實現TF-IDF的計算。除算法細節(jié)上會有差異外，更多的是數據輸入/輸出格式上的不同。

jieba?

輸出結果會自動按照TF-IDF值降序排列，并且直接給出的是詞條而不是字典ID，便于閱讀使用。

可在計算TF-IDF時直接完成分詞，并使用停用詞表和自定義詞庫，非常方便。

有默認的IDF語料庫，可以不訓練模型，直接進行計算。

以單個文本為單位進行分析。

jieba.analyse.extract_tags(

sentence 為待提取的文本
topK = 20 : 返回幾個 TF/IDF 權重最大的關鍵詞
withWeight = False : 是否一并返回關鍵詞權重值
allowPOS = () : 僅包括指定詞性的詞，默認值為空，即不篩選

)

jieba.analyse.set_idf_path(file_name)

關鍵詞提取時使用自定義逆向文件頻率（IDF）語料庫

勞動防護 13.900677652
生化學 13.900677652
奧薩貝爾 13.900677652
奧薩貝爾 13.900677652
考察隊員 13.900677652

jieba.analyse.set_stop_words(file_name)

關鍵詞提取時使用自定義停止詞（Stop Words）語料庫

jieba.analyse.TFIDF(idf_path = None)

新建 TFIDF模型實例
idf_path : 讀取已有的TFIDF頻率文件（即已有模型）
使用該實例提取關鍵詞：TFIDF實例.extract_tags()

In [ ]:

import jiebaimport jieba.analyse# 注意：函數是在使用默認的TFIDF模型進行分析！jieba.analyse.extract_tags(chapter.txt[1])

In [ ]:

jieba.analyse.extract_tags(chapter.txt[1], withWeight = True) # 要求返回權重值

In [ ]:

# 應用自定義詞典改善分詞效果jieba.load_userdict('金庸小說詞庫.txt') # dict為自定義詞典的路徑# 在TFIDF計算中直接應用停用詞表jieba.analyse.set_stop_words('停用詞.txt')TFres = jieba.analyse.extract_tags(chapter.txt[1], withWeight = True)TFres[:10]

In [ ]:

# 使用自定義TF-IDF頻率文件jieba.analyse.set_idf_path("idf.txt.big")TFres1 = jieba.analyse.extract_tags(chapter.txt[1], withWeight = True)TFres1[:10]

sklearn?

輸出格式為矩陣，直接為后續(xù)的sklearn建模服務。

需要先使用背景語料庫進行模型訓練。

結果中給出的是字典ID而不是具體詞條，直接閱讀結果比較困難。

class sklearn.feature_extraction.text.TfidfTransformer()

發(fā)現參數基本上都不用動，所以這里就不介紹了...

In [ ]:

from sklearn.feature_extraction.text import TfidfTransformertxtlist = [ " ".join(m_cut(w)) for w in chapter.txt.iloc[:5]] vectorizer = CountVectorizer() X = vectorizer.fit_transform(txtlist) # 將文本中的詞語轉換為詞頻矩陣  transformer = TfidfTransformer()  tfidf = transformer.fit_transform(X)  #基于詞頻矩陣X計算TF-IDF值  tfidf

In [ ]:

tfidf.toarray() # 轉換為數組

In [ ]:

tfidf.todense() # 轉換為矩陣

In [ ]:

tfidf.todense().shape

In [ ]:

print("字典長度：", len(vectorizer.vocabulary_))vectorizer.vocabulary_

gensim?

輸出格式為list，目的也是為后續(xù)的建模分析服務。

需要先使用背景語料庫進行模型訓練。

結果中給出的是字典ID而不是具體詞條，直接閱讀結果比較困難。

gensim也提供了sklearn的API接口：sklearn_api.tfidf，可以在sklearn中直接使用。

In [ ]:

# 文檔分詞及預處理  chaplist = [m_cut(w) for w in chapter.txt.iloc[:5]]chaplist

In [ ]:

from gensim import corpora, models  # 生成文檔對應的字典和bow稀疏向量dictionary = corpora.Dictionary(chaplist)  corpus = [dictionary.doc2bow(text) for text in chaplist] # 仍為list in list  corpus

In [ ]:

tfidf_model = models.TfidfModel(corpus) # 建立TF-IDF模型  corpus_tfidf = tfidf_model[corpus] # 對所需文檔計算TF-IDF結果corpus_tfidf

In [ ]:

corpus_tfidf[3] # 列出所需文檔的TF-IDF計算結果

In [ ]:

dictionary.token2id # 列出字典內容

TextRank算法?

TextRank算法的jieba實現?

jieba.analyse.textrank(

sentence, topK=20, withWeight=False,
allowPOS=('ns', 'n', 'vn', 'v')

) # 注意默認過濾詞性

In [ ]:

jieba.analyse.textrank(chapter.txt[1], topK=20, withWeight = True)

實戰(zhàn)練習?

請使用《射雕》全文計算出jieba分詞的IDF語料庫，然后使用該語料庫重新對第一章計算關鍵詞。比較這樣的分析結果和以前有何不同。

請自行編制將jieba分詞的TF-IDF結果轉換為文檔-詞條矩陣格式的程序。

請自行思考本章提供的三種TF-IDF實現方式的使用場景是什么。

抽取文檔主題?

主題模型的基本概念?

sklearn實現?

在scikit-learn中，LDA主題模型的類被放置在sklearn.decomposition.LatentDirichletAllocation類中，其算法實現主要基于變分推斷EM算法，而沒有使用基于Gibbs采樣的MCMC算法實現。

注意由于LDA是基于詞頻統計的，因此理論上一般不宜用TF-IDF來做文檔特征，但并非不能嘗試。實際分析中也確實會見到此類操作。

class sklearn.decomposition.LatentDirichletAllocation(

n_components = None : 隱含主題數K，需要設置的最重要參數。
    K的設定范圍和具體的研究背景有關。
    K越大，需要的文檔樣本越多。
doc_topic_prior = None : 文檔主題先驗Dirichlet分布的參數α，未設定則用1/K。
topic_word_prior = None : 主題詞先驗Dirichlet分布的參數η，未設定則用1/K。

learning_method = 'online' : 即LDA的求解算法。'batch' | 'online'
    batch: 變分推斷EM算法，會將將訓練樣本分批用于更新主題詞分布，新版默認算法。
        樣本量不大只是用來學習的話用batch比較好，這樣可以少很多參數要調。
        需注意n_components(K), doc_topic_prior(α), topic_word_prior(η)
    online: 在線變分推斷EM算法，大樣本時首選。
        需進一步注意learning_decay, learning_offset，
            total_samples和batch_size等參數。

僅在online算法時需要設定的參數
    learning_decay = 0.7 ：控制"online"算法的學習率，一般不用修改。
        取值最好在(0.5, 1.0]，以保證"online"算法漸進的收斂。
    learning_offset = 10. ：用來減小前面訓練樣本批次對最終模型的影響。
        取值要大于1。
    total_samples = 1e6 ： 分步訓練時每一批文檔樣本的數量。
        使用partial_fit進行模型擬合時才需要此參數。
    batch_size = 128 : 每次EM算法迭代時使用的文檔樣本的數量。

)

將語料庫轉換為所需矩陣?

除直接使用分詞清理后文本進行轉換外，也可以先計算關鍵詞的TF-IDF值，然后使用關鍵詞矩陣進行后續(xù)分析。

In [ ]:

# 設定分詞及清理停用詞函數# 熟悉Python的可以使用 open('stopWord.txt').readlines（） 獲取停用詞list，效率更高stoplist = list(pd.read_csv('停用詞.txt', names = ['w'], sep = 'aaa', 
                            encoding = 'utf-8', engine='python').w)import jieba def m_cut(intxt):    return [ w for w in jieba.cut(intxt) 
            if w not in stoplist and len(w) > 1 ]

In [ ]:

# 生成分詞清理后章節(jié)文本cleanchap = [ " ".join(m_cut(w)) for w in chapter.txt]

In [ ]:

# 將文本中的詞語轉換為詞頻矩陣  from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer(min_df = 5) wordmtx = countvec.fit_transform(cleanchap) wordmtx

In [ ]:

#基于詞頻矩陣X計算TF-IDF值  from sklearn.feature_extraction.text import TfidfTransformertransformer = TfidfTransformer()  tfidf = transformer.fit_transform(wordmtx)  tfidf

In [ ]:

# 設定LDA模型from sklearn.decomposition import LatentDirichletAllocationn_topics = 10ldamodel = LatentDirichletAllocation(n_components = n_topics)

In [ ]:

# 擬合LDA模型，注意這里使用的是原始wordmtx矩陣ldamodel.fit(wordmtx)

In [ ]:

# 擬合后模型的實質print(ldamodel.components_.shape)ldamodel.components_[:2]

In [ ]:

# 主題詞打印函數def print_top_words(model, feature_names, n_top_words):    for topic_idx, topic in enumerate(model.components_):        print("Topic #%d:" % topic_idx)        print(" ".join([feature_names[i] 
                        for i in topic.argsort()[:-n_top_words - 1:-1]]))    print()

In [ ]:

n_top_words = 12tf_feature_names = countvec.get_feature_names()print_top_words(ldamodel, tf_feature_names, n_top_words)

gensim實現?

class gensim.models.ldamodel.LdaModel(

corpus = None : 用于訓練模型的語料
num_topics = 100 : 準備提取的主題數量
id2word = None : 所使用的詞條字典，便于結果閱讀
passes = 1 ：模型遍歷語料庫的次數，次數越多模型越精確，但是也更花時間

)

用新出現的語料更新模型

ldamodel.update(other_corpus)

gensim也提供了sklearn的API接口：sklearn_api.ldamodel，可以在sklearn中直接使用。

In [ ]:

# 設定分詞及清理停用詞函數# 熟悉Python的可以使用 open('stopWord.txt').readlines()獲取停用詞list，效率更高stoplist = list(pd.read_csv('停用詞.txt', names = ['w'], sep = 'aaa', 
                            encoding = 'utf-8', engine='python').w)import jieba def m_cut(intxt):    return [ w for w in jieba.cut(intxt) 
            if w not in stoplist and len(w) > 1 ]

In [ ]:

# 文檔預處理，提取主題詞  chaplist = [m_cut(w) for w in chapter.txt]

In [ ]:

# 生成文檔對應的字典和bow稀疏向量from gensim import corpora, models  dictionary = corpora.Dictionary(chaplist)  corpus = [dictionary.doc2bow(text) for text in chaplist] # 仍為list in list  tfidf_model = models.TfidfModel(corpus) # 建立TF-IDF模型  corpus_tfidf = tfidf_model[corpus] # 對所需文檔計算TF-IDF結果corpus_tfidf

In [ ]:

from gensim.models.ldamodel import LdaModel# 列出所消耗的時間備查%time ldamodel1 = LdaModel(corpus, id2word = dictionary, \                          num_topics = 10, passes = 2)

列出最重要的前若干個主題?

print_topics(num_topics=20, num_words=10)

In [ ]:

ldamodel1.print_topics()

In [ ]:

# 計算各語料的LDA模型值corpus_lda = ldamodel1[corpus_tfidf] # 此處應當使用和模型訓練時相同類型的矩陣for doc in corpus_lda:    print(doc)

In [ ]:

ldamodel1.get_topics()

In [ ]:

# 檢索和文本內容最接近的主題query = chapter.txt[1] # 檢索和第1章最接近的主題query_bow = dictionary.doc2bow(m_cut(query)) # 頻數向量query_tfidf = tfidf_model[query_bow] # TF-IDF向量print("轉換后：", query_tfidf[:10])ldamodel1.get_document_topics(query_bow) # 需要輸入和文檔對應的bow向量

In [ ]:

# 檢索和文本內容最接近的主題ldamodel1[query_tfidf]

結果的圖形化呈現?

文檔主題在呈現時需要解決的需求：

每個主題的含義是什么？
每個主題的重要性如何？是否是重要的主題？
主題直接的聯系是怎樣的？

pyLDAvis包引入自R，可以用交互式圖形的方式呈現主題模型的分析結果。

同時支持sklearn和gensim包。

在許多系統配置下都會出現兼容問題。

安裝時會先從最高版本的包進行下載，然后根據兼容性報錯依次降級，直至找到適合的包為止（這都什么奇葩操作）

pip install pyLDAvis

pyLDAvis的結果呈現方式：

左側：各個主題模型在模型空間中的相互關系和重要性。
    空間定位使用MDS方式實現。
    圓圈大小則代表該主題的流行程度（頻數意義上的重要性）。
右側：列出和當前選中主題頻數關聯最強的詞條。

Lambda參數的調節(jié)方式：

1 : 重要性完全由詞條的頻數高低來決定
0：重要性完全由詞條提升程度來決定
    lift值：詞條在某主題下的出現頻度/詞條在整個文檔中的出現頻度

class pyLDAvis.sklearn.prepare(

lda_model : 用sklearn基于dtm訓練而來的Latent Dirichlet Allocation model
dtm : 用于訓練lda_model的Document-term matrix
vectorizer ：將raw documents轉換為dtm時使用的vectorizer

) # 返回值：用于可視化的數據結構

pyLDAvis.gensim.prepare()函數的參數設定與上面完全相同

In [ ]:

# 對sklearn的LDA結果作呈現import pyLDAvisimport pyLDAvis.sklearnpyLDAvis.enable_notebook()

In [ ]:

pyLDAvis.sklearn.prepare(ldamodel, wordmtx, countvec)

In [ ]:

# 對gensim的LDA結果作呈現import pyLDAvis.gensimpyLDAvis.enable_notebook()

In [ ]:

pyLDAvis.gensim.prepare(ldamodel1, corpus, dictionary)

In [ ]:

pyLDAvis.disable_notebook() # 關閉notebook支持后，可以看到背后所生成的數據

實戰(zhàn)練習?

在其余參數全部固定不變的情況下，嘗試分別用清理前矩陣、清理后原始矩陣、TF-IDF矩陣進行LDA模型擬合，比較分析結果。

在gensim擬合LDA時，分別將passes參數設置為1、5、10、50、100等，觀察結果變化的情況，思考如何對該參數做最優(yōu)設定。

請嘗試對模型進行優(yōu)化，得到對本案例較好的分析結果。

提示：使用gensim進行擬合更容易一些。

文檔相似度?

基本概念?

詞條相似度：word2vec?

詞袋模型不考慮詞條之間的相關性，因此無法用于計算詞條相似度。

分布式表達會考慮詞條的上下文關聯，因此能夠提取出詞條上下文中的相關性信息，而詞條之間的相似度就可以直接利用此類信息加以計算。

目前主要使用gensim實現相應的算法。

gensim也提供了sklearn的API接口：sklearn_api.w2vmodel，可以在sklearn中直接使用。

設置word2vec模型?

class gensim.models.word2vec.Word2Vec(

sentences = None : 類似list of list的格式，對于特別大的文本，盡量考慮流式處理
vector_size = 100 : 詞條向量的維度，數據量充足時，300/500的效果會更好
    老版本中該參數為size
window = 5 : 上下文窗口大小
workers = 3 : 同時運行的線程數，多核系統可明顯加速計算

其余細節(jié)參數設定：
    min_count = 5 : 低頻詞過濾閾值，低于該詞頻的不納入模型
    max_vocab_size = None : 每1千萬詞條需要1G內存，必要時設定該參數以節(jié)約內存
    sample=0.001 : 負例采樣的比例設定
    negative=5 : 一般為5-20，設為0時不進行負例采樣
    iter = 5 : 模型在語料庫上的迭代次數，該參數將被取消

與神經網絡模型有關的參數設定：
    seed=1, alpha=0.025, min_alpha=0.0001, sg=0, hs=0

)

In [ ]:

chapter.head()

In [ ]:

# 分詞和預處理，生成list of list格式import jiebachapter['cut'] = chapter.txt.apply(jieba.lcut)chapter.head()

In [ ]:

# 初始化word2vec模型和詞表from gensim.models.word2vec import Word2Vecn_dim = 300 # 指定向量維度，大樣本量時300~500較好w2vmodel = Word2Vec(vector_size = n_dim, min_count = 10)w2vmodel.build_vocab(chapter.cut) # 生成詞表w2vmodel

對word2vec模型進行訓練?

word2vecmodel.train(

sentences : iterable of iterables格式，對于特別大量的文本，盡量考慮流式處理
total_examples = None : 句子總數，int，可直接使用model.corpus_count指定
total_words = None : 句中詞條總數，int，該參數和total_examples至少要指定一個
epochs = None : 模型迭代次數，需要指定

其他帶默認值的參數設定：
    start_alpha=None, end_alpha=None, word_count=0, queue_factor=2,
    report_delay=1.0, compute_loss=False, callbacks=()

)

In [ ]:

# 在評論訓練集上建模（大數據集時可能會花費幾分鐘）# 本例消耗內存較少%time w2vmodel.train(chapter.cut, \               total_examples = w2vmodel.corpus_count, epochs = 10)

In [ ]:

# 訓練完畢的模型實質print(w2vmodel.wv["郭靖"].shape)w2vmodel.wv["郭靖"]

w2v模型的保存和復用?

w2vmodel.save(存盤路徑及文件名稱)
w2vmodel.load(存盤路徑及文件名稱)

詞向量間的相似度?

w2vmodel.wv.most_similar(詞條)

In [ ]:

w2vmodel.wv.most_similar("郭靖")

In [ ]:

w2vmodel.wv.most_similar("黃蓉", topn = 20)

In [ ]:

w2vmodel.wv.most_similar("黃蓉道")

In [ ]:

# 尋找對應關系w2vmodel.wv.most_similar(['郭靖', '小紅馬'], ['黃藥師'], topn = 5)

In [ ]:

w2vmodel.wv.most_similar(positive=['郭靖', '黃蓉'], negative=['楊康'], topn=10)

In [ ]:

# 計算兩個詞的相似度/相關程度print(w2vmodel.wv.similarity("郭靖", "黃蓉"))print(w2vmodel.wv.similarity("郭靖", "楊康"))print(w2vmodel.wv.similarity("郭靖", "楊鐵心"))

In [ ]:

# 尋找不合群的詞w2vmodel.wv.doesnt_match("小紅馬 黃藥師 魯有腳".split())

In [ ]:

w2vmodel.wv.doesnt_match("楊鐵心 黃藥師 黃蓉 洪七公".split())

文檔相似度?

基于詞袋模型計算?

sklearn實現?

sklearn.metrics.pairwise.pairwise_distances(

X : 用于計算距離的數組
    [n_samples_a, n_samples_a] if metric == 'precomputed'
    [n_samples_a, n_features] otherwise
Y = None : 用于計算距離的第二數組，當metric != 'precomputed'時可用

metric = 'euclidean' : 空間距離計算方式
    scikit-learn原生支持 : ['cityblock', 'cosine', 'euclidean', 
        'l1', 'l2', 'manhattan']，可直接使用稀疏矩陣格式
    來自scipy.spatial.distance : ['braycurtis', 'canberra', 
        'chebyshev', 'correlation', 'dice', 'hamming', 'jaccard',
        'kulsinski', 'mahalanobis', 'matching', 'minkowski',
        'rogerstanimoto', 'russellrao', 'seuclidean', 'sokalmichener',
        'sokalsneath', 'sqeuclidean', 'yule'] 不支持稀疏矩陣格式

n_jobs = 1 : 用于計算的線程數，為-1時，所有CPU內核都用于計算

)

In [ ]:

cleanchap = [ " ".join(m_cut(w)) for w in chapter.txt.iloc[:5]] from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer() resmtx = countvec.fit_transform(cleanchap)resmtx

In [ ]:

from sklearn.metrics.pairwise import pairwise_distancespairwise_distances(resmtx, metric = 'cosine')

In [ ]:

pairwise_distances(resmtx) # 默認值為euclidean

In [ ]:

# 使用TF-IDF矩陣進行相似度計算pairwise_distances(tfidf[:5], metric = 'cosine')

gensim實現?

In [ ]:

from gensim import similaritiessimmtx = similarities.MatrixSimilarity(corpus)simmtx

基于LDA計算余弦相似度?

需要使用的信息：

擬合完畢的lda模型
按照擬合模型時矩陣種類轉換的需檢索文本
    需檢索的文本
    建模時使用的字典

In [ ]:

# 檢索和第1章內容最相似（所屬主題相同）的章節(jié)simmtx = similarities.MatrixSimilarity(corpus) # 使用的矩陣種類需要和擬合模型時相同simmtx

In [ ]:

simmtx.index[:2]

In [ ]:

# 使用gensim的LDA擬合結果進行演示query = chapter.txt[1] query_bow = dictionary.doc2bow(m_cut(query))lda_vec = ldamodel1[query_bow] # 轉換為lda模型下的向量sims = simmtx[lda_vec] # 進行矩陣內向量和所提供向量的余弦相似度查詢sims = sorted(enumerate(sims), key=lambda item: -item[1])sims

doc2vec?

word2vec用來計算詞條相似度非常合適。

較短的文檔如果希望計算文本相似度，可以將各自內部的word2vec向量分別進行平均，用平均后的向量作為文本向量，從而用于計算相似度。

但是對于長文檔，這種平均的方式顯然過于粗糙。

doc2vec是word2vec的拓展，它可以直接獲得sentences/paragraphs/documents的向量表達，從而可以進一步通過計算距離來得到sentences/paragraphs/documents之間的相似性。

模型概況

分析目的：獲得文檔的一個固定長度的向量表達。
數據：多個文檔，以及它們的標簽，一般可以用標題作為標簽。 
影響模型準確率的因素：語料的大小，文檔的數量，越多越高；文檔的相似性，越相似越好。

In [ ]:

import jieba import gensimfrom gensim.models import doc2vecdef m_doc(doclist):    reslist = []    for i, doc in enumerate(doclist):        reslist.append(doc2vec.TaggedDocument(jieba.lcut(doc), [i]))    return reslistcorp = m_doc(chapter.txt)

In [ ]:

corp[:2]

In [ ]:

d2vmodel = gensim.models.Doc2Vec(vector_size = 300, 
                window = 20, min_count = 5)%time d2vmodel.build_vocab(corp)

In [ ]:

# The vocab attribute was removed from KeyedVector in Gensim 4.0.0.d2vmodel.wv.key_to_index

In [ ]:

# 將新文本轉換為相應維度空間下的向量newvec = d2vmodel.infer_vector(jieba.lcut(chapter.txt[1]))

In [ ]:

d2vmodel.docvecs.most_similar([newvec], topn = 10)

文檔聚類?

在得到文檔相似度的計算結果后，文檔聚類問題在本質上已經和普通的聚類分析沒有區(qū)別。

注意：最常用的Kmeans使用的是平方歐氏距離，這在文本聚類中很可能無法得到最佳結果。

算法的速度和效果同樣重要。

In [ ]:

# 為章節(jié)增加名稱標簽chapter.index = [raw.txt[raw.chap == i].iloc[0] for i in chapter.index]chapter.head()

In [ ]:

import jiebacuttxt = lambda x: " ".join(m_cut(x)) cleanchap = chapter.txt.apply(cuttxt) cleanchap[:2]

In [ ]:

# 計算TF-IDF矩陣from sklearn.feature_extraction.text import TfidfTransformervectorizer = CountVectorizer() wordmtx = vectorizer.fit_transform(cleanchap) # 將文本中的詞語轉換為詞頻矩陣  transformer = TfidfTransformer()  tfidf = transformer.fit_transform(wordmtx)  #基于詞頻矩陣計算TF-IDF值  tfidf

In [ ]:

# 進行聚類分析from sklearn.cluster import KMeans  clf = KMeans(n_clusters = 5)  s = clf.fit(tfidf)  print(s)  clf.cluster_centers_

In [ ]:

clf.cluster_centers_.shape

In [ ]:

clf.labels_

In [ ]:

chapter['clsres'] = clf.labels_chapter.head()

In [ ]:

chapter.sort_values('clsres').clsres

In [ ]:

chapgrp = chapter.groupby('clsres')chapcls = chapgrp.agg(sum) # 只有字符串列的情況下，sum函數自動轉為合并字符串cuttxt = lambda x: " ".join(m_cut(x)) chapclsres = chapcls.txt.apply(cuttxt) chapclsres

In [ ]:

# 列出關鍵詞以刻畫類別特征import jieba.analyse as anaana.set_stop_words('停用詞.txt')for item in chapclsres:    print(ana.extract_tags(item, topK = 10))

實戰(zhàn)練習?

在計算詞條相似度時進行停用詞清理，然后再進行擬合，思考為什么會有這樣的結果出現。

在基于詞袋模型，使用原始詞頻計算文本余弦相似度時，比較清理停用詞前后的結果。

文檔分類?

文檔分類方法概述?

樸素貝葉斯算法?

sklearn實現?

sklearn是標準的數據挖掘建模工具包，在語料轉換為d2m矩陣結構之后，就可以使用所有標準的DM建模手段在sklearn中進行分析。

在sklearn中也實現了樸素貝葉斯算法，使用方式上也和其他模型非常相似。

生成D2M矩陣?

In [ ]:

# 從原始語料df中提取出所需的前兩章段落raw12 = raw[raw.chap.isin([1,2])]raw12ana = raw12.iloc[list(raw12.txt.apply(len) > 50), :] # 只使用超過50字的段落raw12ana.reset_index(drop = True, inplace = True)print(len(raw12ana))raw12ana.head()

In [ ]:

# 分詞和預處理import jiebacuttxt = lambda x: " ".join(jieba.lcut(x)) # 這里不做任何清理工作，以保留情感詞raw12ana["cleantxt"] = raw12ana.txt.apply(cuttxt) raw12ana.head()

In [ ]:

from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer() wordmtx = countvec.fit_transform(raw12ana.cleantxt)wordmtx

劃分訓練集和測試集?

In [ ]:

# 作用：將數據集劃分為 訓練集和測試集from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(wordmtx, raw12ana.chap, 
    test_size = 0.3, random_state = 111)

擬合樸素貝葉斯模型?

In [ ]:

from sklearn import naive_bayesNBmodel = naive_bayes.MultinomialNB()

In [ ]:

# 擬合模型NBmodel.fit(x_train, y_train)

In [ ]:

# 進行驗證集預測x_test

In [ ]:

NBmodel.predict(x_test)

模型評估?

In [ ]:

# 預測準確率（給模型打分）print('訓練集：', NBmodel.score(x_train, y_train), 
      '，驗證集：', NBmodel.score(x_test, y_test))

In [ ]:

from sklearn.metrics import classification_reportprint(classification_report(y_test, NBmodel.predict(x_test)))

使用Logistic回歸模型進行分類?

In [ ]:

from sklearn.linear_model import LogisticRegressionlogitmodel = LogisticRegression() # 定義Logistic回歸模型

In [ ]:

# 擬合模型logitmodel.fit(x_train, y_train)print(classification_report(y_test, logitmodel.predict(x_test)))

模型預測?

將需要預測的文本轉換為和建模時格式完全對應的d2m矩陣格式，隨后即可進行預測。

In [ ]:

countvec.vocabulary_

In [ ]:

string = "楊鐵心和包惜弱收養(yǎng)穆念慈"words = " ".join(jieba.lcut(string))words_vecs = countvec.transform([words]) # 數據需要轉換為可迭代的list格式words_vecs

In [ ]:

NBmodel.predict(words_vecs)

NLTK實現?

NLTK中內置了樸素貝葉斯算法，可直接實現文檔分類。

數據集中語料的格式?

用于訓練的語料必須是分詞完畢的字典形式，詞條為鍵名，鍵值則可以是數值、字符、或者T/F

{'張三' : True, '李四' : True, '王五' : False}
{'張三' : 1, '李四' : 1, '王五' : 0}
{'張三' : '有', '李四' : '有', '王五' : '無'}

In [ ]:

# 使用Pandas的命令進行轉換freqlist.to_dict()

In [ ]:

df0.groupby(['word']).agg('size').tail(10).to_dict()

訓練用數據集的格式?

訓練用數據集為list of list格式，每個成員為list[語料字典, 結果變量]

[
[{'張三' : 1, '李四' : 1, '王五' : 0}, '合格'],
[{'張三' : 0, '李四' : 1, '王五' : 0}, '不合格']
]

構建模型?

考慮到過擬合問題，此處需要先拆分好訓練集和測試集

model = NaiveBayesClassifier.train(training_data)

In [ ]:

# 這里直接以章節(jié)為一個單元進行分析，以簡化程序結構import nltkfrom nltk import FreqDist# 生成完整的詞條頻數字典，這部分也可以用遍歷方式實現fdist1 = FreqDist(m_cut(chapter.txt[1])) fdist2 = FreqDist(m_cut(chapter.txt[2])) fdist3 = FreqDist(m_cut(chapter.txt[3])) fdist1

In [ ]:

from nltk.classify import NaiveBayesClassifiertraining_data = [ [fdist1, 'chap1'], [fdist2, 'chap2'], [fdist3, 'chap3'] ]

In [ ]:

# 訓練分類模型NLTKmodel = NaiveBayesClassifier.train(training_data)

In [ ]:

print(NLTKmodel.classify(FreqDist(m_cut("楊鐵心收養(yǎng)穆念慈"))))print(NLTKmodel.classify(FreqDist(m_cut("錢塘江 日日夜夜 包惜弱 顏烈 使出楊家槍"))))

模型擬合效果的考察?

In [ ]:

nltk.classify.accuracy(NLTKmodel, training_data) # 準確度評價

In [ ]:

NLTKmodel.show_most_informative_features(5)#得到似然比，檢測對于哪些特征有用

實戰(zhàn)作業(yè)?

對射雕的前兩個章節(jié)提取關鍵字，然后使用關鍵字而不是原始文本進行文檔分類，比較這樣兩種方式的分類效果有何變化。

減少用于訓練的樣本量，考察使用樸素貝葉斯算法或者其他標準分類算法時，模型效果的變化趨勢。

提示：對編程比較熟悉的學員可以自行編制循環(huán)程序，自動完成樣本量和模型效果的曲線

自行實現基于NLTK的按段落為單位進行章節(jié)分類的程序。

自行下載金庸或者古龍的另一本武俠小說，構建任一文本段落在該小說和射雕之間的分類模型。

情感分析?

情感分析概述?

基于詞袋模型的分析?

數據概況：

抓取自購物網站的正向、負向評論各約1萬條。
涵蓋了數碼、書籍、食品等多個領域。

In [ ]:

# 讀入原始數據集import pandas as pddfpos = pd.read_excel("購物評論.xlsx", sheet_name = "正向", header=None)dfpos['y'] = 1dfneg = pd.read_excel("購物評論.xlsx", sheet_name = "負向", header=None)dfneg['y'] = 0df0 = dfpos.append(dfneg, ignore_index = True)df0.head()

In [ ]:

# 分詞和預處理import jiebacuttxt = lambda x: " ".join(jieba.lcut(x)) # 這里不做任何清理工作，以保留情感詞df0["cleantxt"] = df0[0].apply(cuttxt) df0.head()

In [ ]:

from sklearn.feature_extraction.text import CountVectorizercountvec = CountVectorizer(min_df = 5) # 出現5次以上的才納入wordmtx = countvec.fit_transform(df0.cleantxt)wordmtx

In [ ]:

# 按照7：3的比例生成訓練集和測試集from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(    wordmtx, df0.y, test_size=0.3) # 這里可以直接使用稀疏矩陣格式x_train[0]

In [ ]:

# 使用SVM進行建模from sklearn.svm import SVCclf=SVC(kernel = 'rbf', verbose = True)clf.fit(x_train, y_train) # 內存占用可能較高clf.score(x_train, y_train)

In [ ]:

# 對模型效果進行評估from sklearn.metrics import classification_reportprint(classification_report(y_test, clf.predict(x_test)))

In [ ]:

clf.predict(countvec.transform([df0.cleantxt[0]]))[0]

In [ ]:

# 模型預測import jiebadef m_pred(string, countvec, model) : 
    words = " ".join(jieba.lcut(string))    words_vecs = countvec.transform([words]) # 數據需要轉換為可迭代格式     
    result = model.predict(words_vecs)    
    if int(result[0]) == 1:        print(string, "：正向")    else:        print(string, "：負向")        comment = "外觀美觀，速度也不錯。上面一排觸摸鍵挺實用。應該對得起這個價格。當然再降點大家肯定也不反對。風扇噪音也不大。"m_pred(comment, countvec, clf)

In [ ]:

comment = "作為女兒6.1的禮物。雖然晚到了幾天。等拿到的時候，女兒愛不釋手，上洗手間也看，告知不好。竟以學習毛主席來反駁我。我反對了幾句，還說我對主席不敬。暈。上周末，告訴我她把火鞋和風鞋拿到學校，好多同學羨慕她。呵呵，我也看了其中的人鴉，只可惜沒有看完就在老公的催促下睡了。說了這么多，歸納為一句：這套書買的值。"m_pred(comment, countvec, clf)

基于分布式表達的分析?

和詞袋模型相比，分布式表達主要是改變了文本信息的提取方式。

目前主要使用gensim實現相應的算法。

注意：由于矩陣不再是頻數值，因此不能使用樸素貝葉斯算法來進行擬合。

In [ ]:

# 讀入原始數據集，和上面完全相同import pandas as pddfpos = pd.read_excel("購物評論.xlsx", sheet_name = "正向", header=None)dfpos['y'] = 1dfneg = pd.read_excel("購物評論.xlsx", sheet_name = "負向", header=None)dfneg['y'] = 0df0 = dfpos.append(dfneg, ignore_index = True)df0.head()

In [ ]:

# 分詞和預處理，生成list of list格式import jiebadf0['cut'] = df0[0].apply(jieba.lcut)df0.head()

In [ ]:

# 按照7：3的比例生成訓練集和測試集from sklearn.model_selection import train_test_splitx_train, x_test, y_train, y_test = train_test_split(    df0.cut, df0.y, test_size=0.3)x_train[:2]

設置word2vec模型?

In [ ]:

# 初始化word2vec模型和詞表from gensim.models.word2vec import Word2Vecn_dim = 300 # 指定向量維度，大樣本量時300~500較好w2vmodel = Word2Vec(vector_size = n_dim, min_count = 10)w2vmodel.build_vocab(x_train) # 生成詞表

In [ ]:

# 在評論訓練集上建模（大數據集時可能會花費幾分鐘）# 本例消耗內存較少%time w2vmodel.train(x_train, \               total_examples = w2vmodel.corpus_count, epochs = 10)

In [ ]:

# 情感詞向量間的相似度w2vmodel.wv.most_similar("不錯")

In [ ]:

w2vmodel.wv.most_similar("失望")

生成整句向量用于情感分值預測?

對購物評論、微博等短文本而言，一般是將所有詞向量的平均值作為分類算法的輸入值。

In [ ]:

# 生成整句所對應的所有詞條的詞向量矩陣pd.DataFrame([w2vmodel.wv[w] for w in df0.cut[0] if w in w2vmodel.wv]).head()

In [ ]:

# 用各個詞向量直接平均的方式生成整句對應的向量def m_avgvec(words, w2vmodel):    return pd.DataFrame([w2vmodel.wv[w] 
                  for w in words if w in w2vmodel.wv]).agg("mean")

In [ ]:

# 生成建模用矩陣，耗時較長%time train_vecs = pd.DataFrame([m_avgvec(s, w2vmodel) for s in x_train])train_vecs.head()

情感分析模型擬合?

In [ ]:

# 用轉換后的矩陣擬合SVM模型from sklearn.svm import SVCclf2 = SVC(kernel = 'rbf', verbose = True)clf2.fit(train_vecs, y_train) # 占用內存小于1Gclf2.score(train_vecs, y_train)

In [ ]:

from sklearn.metrics import classification_reportprint(classification_report(y_train, clf2.predict(train_vecs))) # 此處未用驗證集

In [ ]:

# 保存訓練完畢的模型以便今后使用# sklearn在0.23版之后已移除joblib，需直接安裝joblib包并import joblibimport joblib # joblib.dump(modelname, 'filename')# modelname = joblib.load('filename')

In [ ]:

# 模型預測import jiebadef m_pred(string, model):    words = jieba.lcut(string)    words_vecs = pd.DataFrame(m_avgvec(words, w2vmodel)).T     
    result = model.predict(words_vecs)    
    if int(result[0]) == 1:        print(string, "：正向")    else:        print(string, "：負向")        comment = "作為女兒6.1的禮物。雖然晚到了幾天。等拿到的時候，女兒愛不釋手，上洗手間也看，告知不好。竟以學習毛主席來反駁我。我反對了幾句，還說我對主席不敬。暈。上周末，告訴我她把火鞋和風鞋拿到學校，好多同學羨慕她。呵呵，我也看了其中的人鴉，只可惜沒有看完就在老公的催促下睡了。說了這么多，歸納為一句：這套書買的值。"m_pred(comment, clf2)

實戰(zhàn)作業(yè)?

自行完成基于情感詞典的分析程序，比較該方法與其他方法的預測準確度。

提示：可使用《知網》情感詞語集作為詞典。

嘗試使用關鍵詞進行基于詞袋模型的情感分析，評估效果的改進情況。

在基于分布式表達的模型中，進行去除停用詞等清理工作，比較前后模型效果的改變情況。

在本章所用數據中，各抽取1千條正向、負向評論，重新擬合基于詞袋的和基于分布式表達的模型，比較前兩種模型效果的改變情況。

文檔自動摘要?

自動摘要的基本原理?

自動摘要的效果評價?

自動摘要的python實現?

In [ ]:

chapter.txt[1]

In [ ]:

def cut_sentence(intxt):  
    delimiters = frozenset('。！？')  
    buf = []  
    for ch in intxt:  
        buf.append(ch)  
        if delimiters.__contains__(ch):  
            yield ''.join(buf)  
            buf = []  
    if buf:  
        yield ''.join(buf)

In [ ]:

sentdf = pd.DataFrame(cut_sentence(chapter.txt[1]))sentdf

In [ ]:

# 去除過短的句子，避免摘要出現無意義的內容sentdf['txtlen'] = sentdf[0].apply(len)sentdf.head()

In [ ]:

sentlist = sentdf[0][sentdf.txtlen > 20]print(len(sentlist))sentlist

In [ ]:

from sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfTransformertxtlist = [ " ".join(jieba.lcut(w)) for w in sentlist]vectorizer = CountVectorizer() X = vectorizer.fit_transform(txtlist) # 將文本中的詞語轉換為詞頻矩陣

In [ ]:

tfidf_matrix = TfidfTransformer().fit_transform(X)

In [ ]:

# 利用nx包實現pagerank算法import networkx as nx  similarity = nx.from_scipy_sparse_matrix(tfidf_matrix * tfidf_matrix.T)  scores = nx.pagerank(similarity)

In [ ]:

scores

In [ ]:

tops = sorted(scores.items(), key = lambda x: x[1], reverse = True)

In [ ]:

tops[:3]

In [ ]:

print(sentlist.iloc[tops[0][0]])print(sentlist.iloc[tops[1][0]])sentlist.iloc[tops[2][0]]

In [ ]:

topn = 5topsent = sorted(tops[:topn])abstract = ''for item in topsent:    abstract = abstract + sentlist.iloc[item[0]] + "......"abstract[:-6]

實戰(zhàn)作業(yè)?

請自行嘗試完成利用TextRank、TF-IDF等指標來抽取句子并生成摘要的程序。

請嘗試使用段落而不是句子來生成摘要。

提示：對于字數較長的段落，可以考慮進一步在其中提取關鍵句來代替整段用于摘要。

思考自動摘要和抽取文檔主題的分析操作有什么異同之處。

自動寫作?

自動寫作的基本原理?

應用場景?

RNN的基本原理?

LSTM的基本原理?

用LSTM實現英文寫作?

英文文本內容的預測可以縮減至預測下一個字符是什么，相應的模型比較簡單，因此可以用來演示LSTM的基本用法。

數據來源：古登堡計劃網站下載txt平文本 https://www.gutenberg.org/wiki/Category:Bookshelf

樣本數據：r&j.txt

注意：目前Keras只支持到python3.6版，在高于3.6版的環(huán)境下可能無法運行。

文本預處理?

In [ ]:

# 載入所需工具包import numpy as npimport pandas as pdfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import Dropoutfrom keras.layers import LSTMfrom keras.callbacks import ModelCheckpointfrom keras.utils import np_utils

In [ ]:

rawtxt = pd.read_csv("r&j.txt", sep = 'aaaaa', 
                     names = ['txt'], engine = 'python')print(rawtxt.head())rawtxt.txt[1]

In [ ]:

# 處理大小寫，末尾增加空格def m_perproc(tmpstr):    return (tmpstr + " ").lower()rawtxt.txt = rawtxt.txt.apply(m_perproc)rawtxt.txt[1]

In [ ]:

raw_txt = rawtxt.txt.agg("sum")raw_txt

In [ ]:

# 將字符轉換為數值代碼以便處理chars = sorted(list(set(raw_txt))) # 生成字符listchar_to_int = dict((c, i) for i, c in enumerate(chars)) # 字符-數值對應字典int_to_char = dict((i, c) for i, c in enumerate(chars)) # 數值-字符對應字典chars

構造訓練測試集?

In [ ]:

seq_length = 100x = []; y = []for i in range(0, len(raw_txt) - seq_length):    given = raw_txt[i:i + seq_length] # 將前seq_length個字符作為預測用變量    predict = raw_txt[i + seq_length] # 將當前字符作為因變量    x.append([char_to_int[char] for char in given])    y.append(char_to_int[predict])

In [ ]:

x[:3]

In [ ]:

y[:3]

將文本的數值表達轉換為LSTM需要的數組格式：[樣本數，時間步伐，特征]

In [ ]:

n_patterns = len(x)n_vocab = len(chars)# 把x變成LSTM需要的格式，reshape最后的1表示每個數值均為單獨一個向量（代表一個字母輸入）x = np.reshape(x, (n_patterns, seq_length, 1)) x = x / float(n_vocab) # 轉換為0-1之間的數值以方便計算x[0]

In [ ]:

# 將因變量的類型正確指定為類別y = np_utils.to_categorical(y) y[0]

建立LSTM模型?

In [ ]:

model = Sequential() # LSTM層指定為128個神經元model.add(LSTM(128, input_shape = (x.shape[1], x.shape[2]))) model.add(Dropout(0.2)) # 拋棄20%的結果，防止過擬合model.add(Dense(y.shape[1], activation = 'softmax')) # 使用標準的NN作為內核# 指定損失函數model.compile(loss = 'categorical_crossentropy', optimizer = 'adam')

In [ ]:

# batch_size為分批量將數據用于訓練，以減小計算資源的需求# epochs次數越多，模型訓練效果越好，但所需時間也線性增加model.fit(x, y, epochs = 2, batch_size = 64)

進行文本預測?

In [ ]:

def predict_next(input_array): # 進行下一個字符的預測    x = np.reshape([0 for i in range(seq_length - len(input_array))]                   + input_array, (1, seq_length, 1)) # 生成預測用的x序列    x = x / float(n_vocab)    y = model.predict(x)    return ydef string_to_index(raw_input): # 將輸入的字符轉換為索引值    res = []    for c in raw_input[(len(raw_input) - seq_length):]:        res.append(char_to_int[c])    return resdef y_to_char(y): # 將預測結果由索引值轉換回字符    largest_index = y.argmax() # 取最大數值對應的索引值    c = int_to_char[largest_index]    return c

In [ ]:

def generate_article(init, rounds = 50): # 按照指定的字符長度進行預測    in_string = init.lower()    for i in range(rounds):        n = y_to_char(predict_next(string_to_index(in_string)))        in_string += n # 將預測到的新字符合并，用于下一步預測    return in_string

In [ ]:

# 進行字母預測init = 'We produce about two million dollars for each hour we work. The fifty hours is one conservative estimate for how long'article = generate_article(init)article

將LSTM與word2vec結合實現中文自動寫作?

字母級別的預測由于無法利用單詞的字母組合信息，同時字母也并非最小語義單位，因此基于字母的預測模型其效果顯然會比較差。

如果要進行單詞級別的預測，則必須要考慮單詞的龐大數量所導致的稀疏向量問題。

word2vec可以對稀疏向量中的有效信息進行濃縮，從而使得單詞級別預測模型的計算量變得可行。

即便如此，在絕大多數情況下，這種自動寫作的計算量也是普通PC難以承受的。

文本預處理?

In [ ]:

# 載入所需工具包import jiebafrom gensim.models.word2vec import Word2Vecimport numpy as npimport pandas as pdfrom keras.models import Sequentialfrom keras.layers import Densefrom keras.layers import Dropoutfrom keras.layers import LSTMfrom keras.callbacks import ModelCheckpointfrom keras.utils import np_utils

In [ ]:

dict = '金庸小說詞庫.txt'jieba.load_userdict(dict) # dict為自定義詞典的路徑# 以整句或者整段為基本單位進行分析顯然更為合適corpus = [jieba.lcut(item) for item in raw.txt]corpus[:3]

將文本轉換為word2vec向量，此處長度越長，則后續(xù)所需的訓練時間也越長。

In [ ]:

# 此處完全可以使用外部語料庫進行更全面的訓練w2v_model = Word2Vec(corpus, vector_size = 100, window = 5, 
                     min_count = 5, workers = 4)

In [ ]:

w2v_model.wv['郭嘯天']

In [ ]:

# 將數據還原為一個長listraw_input = [item for sublist in corpus for item in sublist]print(len(raw_input))raw_input[:10]

In [ ]:

# 列出模型中納入的詞條vocab = w2v_model.wv.index_to_keyvocab

In [ ]:

# min_count = 5參數會過濾掉低頻詞，因此需要在文本中同步清除這些低頻詞text_stream = []for word in raw_input:    if word in vocab:        text_stream.append(word)print(len(text_stream))text_stream[:10]

構造訓練測試集?

In [ ]:

seq_length = 10 # 取前面10個單詞用于預測x = []; y = []for i in range(0, len(text_stream) - seq_length):    given = text_stream[i : i + seq_length]    predict = text_stream[i + seq_length]    x.append(np.array([w2v_model.wv[word] for word in given]))    y.append(w2v_model.wv[predict])

In [ ]:

len(x)

In [ ]:

x[0][0]

In [ ]:

y[0]

隨后將w2v格式的數值表達轉換為LSTM需要的格式：[樣本數，時間步伐，特征]

In [ ]:

x = np.reshape(x, (-1, seq_length, 100)) # 每一個詞條，對應一個word2vec向量y = np.reshape(y, (-1, 100))

建立LSTM模型?

In [ ]:

model = Sequential()model.add(LSTM(128, input_shape = (seq_length, 100)))model.add(Dropout(0.2))model.add(Dense(100, activation = 'sigmoid'))model.compile(loss = 'mse', optimizer = 'adam')

In [ ]:

model.fit(x, y, epochs = 5, batch_size = 64)

In [ ]:

model.summary()

In [ ]:

model.save_weights('LSTM.h5') # 文件類型是HDF5

In [ ]:

model.load_weights('LSTM.h5')

In [ ]:

model.fit(x, y, epochs = 10) # 按照指定的數據和參數繼續(xù)訓練模型

持續(xù)訓練找到優(yōu)化模型?

In [ ]:

from keras.callbacks import ModelCheckpointcheckpointer = ModelCheckpoint(filepath = 'LSTM_best.hdf5', 
                               monitor = 'val_loss', 
                               save_best_only = True, 
                               verbose = 1)

In [ ]:

model.compile(loss = 'mse', optimizer = 'adam')

In [ ]:

model.fit(x, y, epochs = 50, 
          validation_data = (x, y), 
          callbacks = [checkpointer])

進行文本預測?

In [ ]:

def predict_next(input_array):    x = np.reshape(input_array, (-1, seq_length, 100))    y = model.predict(x)    return ydef string_to_index(raw_input):    input_stream = []    for word in jieba.lcut(raw_input):        if word in vocab:            input_stream.append(word)    res = []    for word in input_stream[(len(input_stream) - seq_length):]:        res.append(w2v_model.wv[word])    return resdef y_to_word(y):    word = w2v_model.wv.most_similar(positive = y, topn = 1)    return word

In [ ]:

def generate_article(init, rounds = 50):    in_string = init.lower()    for i in range(rounds):        n = y_to_word(predict_next(string_to_index(in_string)))        in_string += n[0][0]    return in_string

In [ ]:

init = '郭嘯天、楊鐵心越聽越怒。郭嘯天道：“靖康年間徽欽二帝被金兵擄去這件大恥，我們'article = generate_article(init)print(article)

實戰(zhàn)作業(yè)?

有GPU計算條件的，請嘗試安裝和配置TensorFlow的GPU版本。

嘗試對原始文本進行縮減，只篩選出包含郭靖、黃蓉的段落進行訓練，然后進行郭靖、黃蓉之間對話的文本自動寫作。

在其余參數基本保持不變的情況下，將按段落進行訓練修改為按照整句進行訓練，比較兩者的效果。

本站僅提供存儲服務，所有內容均由用戶發(fā)布，如發(fā)現有害或侵權內容，請點擊舉報。

开心六月综合激情婷婷|欧美精品成人动漫二区|国产中文字幕综合色|亚洲人在线成视频

Table of Contents

文本挖掘概述?

什么是文本挖掘?

文本挖掘的基本流程和任務?

文本挖掘的基本思路?

原始語料數據化時需要考慮的工作?

Python文本挖掘的正確打開姿勢?

磨刀不誤砍柴工：工具準備?

Python的常見IDE簡介?

Anaconda的安裝與配置?

Jupyter Notebook的基本操作?

NLTK包的安裝?

什么是NLTK?

NLTK的主要模塊?

如何安裝NLTK?

NLTK的替代包?

語料庫的準備?

什么是語料庫?

常見的語料庫格式?

準備《射雕》語料庫?

讀入為數據框?

加入章節(jié)標識?

提取出所需章節(jié)?

實戰(zhàn)：準備工具與素材?

分詞?

分詞原理簡介?

結巴分詞的基本用法?

安裝?

基本特點?

修改詞典?

動態(tài)增刪新詞?

使用自定義詞典?

使用搜狗細胞詞庫?

去除停用詞?

常見的停用詞種類?

分詞后去除停用詞?

用extract_tags函數去除停用詞?

詞性標注?

分詞的NLTK實現?

實戰(zhàn)：《射雕》一書分詞?

詞云展示?

詞頻統計?

使用Pandas統計?

使用NLTK統計?

詞云概述?

wordcloud包的安裝?

安裝?

中文字體支持?

繪制詞云?

WordCloud的基本語法?

用原始文本直接分詞并繪制?

顯示詞云?

保存詞云?

生成射雕第一章的詞云?

基于分詞頻數繪制?

用頻數生成射雕第一章的詞云?

詞云的美化?

設置背景圖片?

指定圖片色系?

指定單詞組顏色?

實戰(zhàn)：優(yōu)化射雕詞云?

文檔信息的向量化?

詞袋模型?

詞袋模型的gensim實現?

gensim的安裝?

建立字典?

Dictionary類的屬性?

轉換為BOW稀疏向量?

轉換為BOW長向量?

生成文檔-詞條矩陣?

用Pandas庫實現?

用sklearn庫實現?

CountVectorizer類的基本用法?

使用sklearn生成射雕的章節(jié)d2m矩陣?

從詞袋模型到N-gram?

文檔信息的分布式表示?

什么是分布式表示?

共現矩陣?

NNLM模型?