python怎么统计词的频率

Hey小伙伴们👋，今天来聊聊一个超实用的技能——用Python统计词频！这不仅仅是编程爱好者的福利，对于数据分析、文本挖掘、甚至是日常写作的朋友们来说，都是个宝藏技能呢。🔍

为何要统计词频？

在处理大量文本数据时，我们经常需要了解哪些词汇出现的频率最高，这可以帮助我们抓住文本的核心主题，或者在社交媒体上分析热点话题。📊

准备工作

在开始之前，确保你的电脑上已经安装了Python环境，如果没有，可以访问Python官网下载安装，我们还会用到一个非常强大的库——nltk，它可以帮助我们进行自然语言处理，如果你还没有安装，可以通过命令pip install nltk来安装。

读取文本数据

我们需要有一个文本文件来分析，这里假设你已经有了一个文本文件，我们可以用Python的内置函数来读取它。

with open('your_text_file.txt', 'r', encoding='utf-8') as file:
    text = file.read()

分词

我们需要将文本分割成单独的词汇。nltk提供了一个非常方便的分词器。

import nltk
from nltk.tokenize import word_tokenize
确保已经下载了nltk的分词数据
nltk.download('punkt')
分词
words = word_tokenize(text)

统计词频

我们可以使用Python的collections模块中的Counter类来统计每个词出现的次数。

from collections import Counter
统计词频
word_counts = Counter(words)

清洗数据

在统计词频之前，我们可能需要清洗数据，比如去除停用词（常见的但对分析没有太大帮助的词，如“的”、“是”等），以及将所有词汇转换为小写，以确保统计的准确性。

import string
去除标点符号
words = [word.lower() for word in words if word.isalpha()]
去除停用词
nltk.download('stopwords')
stop_words = set(nltk.corpus.stopwords.words('english'))
filtered_words = [word for word in words if word not in stop_words]
word_counts = Counter(filtered_words)

查看结果

我们可以查看出现频率最高的词汇。

查看最常见的10个词
most_common_words = word_counts.most_common(10)
print(most_common_words)

可视化

为了让结果更直观，我们可以用matplotlib库来绘制一个词频的柱状图。

import matplotlib.pyplot as plt
绘制词频柱状图
plt.figure(figsize=(10, 8))
plt.bar(*zip(*most_common_words))
plt.xticks(rotation=45)
plt.show()

进阶玩法

如果你想要更地分析文本，可以考虑使用TF-IDF（词频-逆文档频率）来衡量一个词对于一个文档集或一个语料库中的其中一份文档的重要性。sklearn库提供了一个非常方便的TF-IDF向量化器。

from sklearn.feature_extraction.text import TfidfVectorizer
创建TF-IDF向量化器
vectorizer = TfidfVectorizer(stop_words='english')
tfidf_matrix = vectorizer.fit_transform([' '.join(filtered_words)])
查看TF-IDF结果
print(tfidf_matrix.shape)