Google-BERT-on-fake_or_real-news-dataset

Google-BERT-on-fake_or_real-news-dataset

Description: Use Google BERT on fake_or_real news dataset with best f1 score: 0.986

Showcase

1. Pipeline

Pipeline

First, we got the raw text with title, text and label. Then we use some methods of data processing to operate the text. After the data processing, we put them into the Bert model to train the data, which includes the Bert itself and the Classifier, here I used the feed-forward neural network and add a softmax layer to normalize the output. In the end, we got the predication and other details.

2. Part1: Data processing

(1) Drop non-sentence

• Type1: http[s]://www.claritypress.com/LendmanIII.html
• Type2: [email protected]
• Type3: @EP_President #EP_President 
• Type4: **Want FOX News First * in your inbox every day? Sign up here.**
• Type5: ☮️ 💚 🌍 etc

(2) EDA methods

• Insert word by BERT similarity (Random Insertion)
• Substitute word by BERT similarity (Synonym Replacement)

AS for the first part, I use two methods: drop non-sentence and some EDA methods. I read some text within the fake_or_real news and I find that it contains various type of non-sentence, so I use the regular expression to drop them. And then, I use random insertion and synonym replacement to augment the text.

3. Part2: Bert Model

Bert model

As for the second part, we put the text which we got from the first part into the bert model. The Bert model uses 12 encode layers and finally classifier to get the output.

4. Part3: Result

Result

In the end, we combine different methods of data processing and u can see the f1 score from the chart. We get the best f1 score(0.986) from Cased text + drop sentence.

5. Part4: Reference

(1) EDA:

•Knowledge: https://towardsdatascience.com/these-are-the-easiest-data-augmentation-techniques-in-natural-language-processing-you-can-think-of-88e393fd610
•Implemenation: https://github.com/makcedward/nlpaug

(2) Can’t remove stopwords:

•Deeper Text Understanding for IR with Contextual NeuralLanguage Modeling: https://arxiv.org/pdf/1905.09217
•Understanding the Behaviors of BERT in Ranking : https://arxiv.org/pdf/1904.07531

(3) Bert by Pytorch:

•https://pytorch.org/hub/huggingface_pytorch-pretrained-bert_bert/

(4) Bert Demo:

https://github.com/sugi-chan/custom_bert_pipeline

(5) Dataset:

https://cbmm.mit.edu/sites/default/files/publications/fake-news-paper-NIPS.pdf

I learn the EDA from the two web site and through two articles, I learn that we shouldn’t remove Stopwords which otherwise will destroy the context of sentence. The end is implementation of BERT with Pytorch and the Bert model I learned.

Implementation

1. Preparation

1.1 Set parameters and install and load required package

1
### parameters Setting
2
par_cased = 0 # default cased, 0 means uncased
3
par_cleanup = 1 # default cleanup, 0 means non-cleanup
4
par_eda = 0 # default eda, 0 means non-eda
5
6
pip install pytorch_pretrained_bert nlpaug bert matplotlib sklearn librosa SoundFile nltk pandas
7
8
from __future__ import print_function, division
9
import torch
10
import torch.nn as nn
11
import torch.optim as optim
12
from torch.optim import lr_scheduler
13
import numpy as np
14
import torchvision
15
from torchvision import datasets, models, transforms
16
import matplotlib.pyplot as plt
17
import time
18
import os
19
import copy
20
from torch.utils.data import Dataset, DataLoader
21
from PIL import Image
22
from random import randrange
23
import torch.nn.functional as F
24
from sklearn.metrics import roc_curve, auc
25
import nlpaug.augmenter.char as nac
26
#import nlpaug.augmenter.word as naw
27
import nlpaug.flow as naf
28
from nlpaug.util import Action

1.2 Set tokenizer

1
import torch
2
from pytorch_pretrained_bert import BertTokenizer, BertModel, BertForMaskedLM
3
4
# OPTIONAL: if you want to have more information on what's happening, activate the logger as follows
5
import logging
6
logging.basicConfig(level=logging.INFO)
7
8
# Load pre-trained model tokenizer (vocabulary)
9
if par_cased ==1:
10
    tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
11
else:
12
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

1.3 Define Bert Config

1
class BertLayerNorm(nn.Module):
2
        def __init__(self, hidden_size, eps=1e-12):
3
            """Construct a layernorm module in the TF style (epsilon inside the square root).
4
            """
5
            super(BertLayerNorm, self).__init__()
6
            self.weight = nn.Parameter(torch.ones(hidden_size))
7
            self.bias = nn.Parameter(torch.zeros(hidden_size))
8
            self.variance_epsilon = eps
9
10
        def forward(self, x):
11
            u = x.mean(-1, keepdim=True)
12
            s = (x - u).pow(2).mean(-1, keepdim=True)
13
            x = (x - u) / torch.sqrt(s + self.variance_epsilon)
14
            return self.weight * x + self.bias
15
        
16
17
class BertForSequenceClassification(nn.Module):
18
    """BERT model for classification.
19
    This module is composed of the BERT model with a linear layer on top of
20
    the pooled output.
21
    Params:
22
        `config`: a BertConfig class instance with the configuration to build a new model.
23
        `num_labels`: the number of classes for the classifier. Default = 2.
24
    Inputs:
25
        `input_ids`: a torch.LongTensor of shape [batch_size, sequence_length]
26
            with the word token indices in the vocabulary. Items in the batch should begin with the special "CLS" token. (see the tokens preprocessing logic in the scripts
27
            `extract_features.py`, `run_classifier.py` and `run_squad.py`)
28
        `token_type_ids`: an optional torch.LongTensor of shape [batch_size, sequence_length] with the token
29
            types indices selected in [0, 1]. Type 0 corresponds to a `sentence A` and type 1 corresponds to
30
            a `sentence B` token (see BERT paper for more details).
31
        `attention_mask`: an optional torch.LongTensor of shape [batch_size, sequence_length] with indices
32
            selected in [0, 1]. It's a mask to be used if the input sequence length is smaller than the max
33
            input sequence length in the current batch. It's the mask that we typically use for attention when
34
            a batch has varying length sentences.
35
        `labels`: labels for the classification output: torch.LongTensor of shape [batch_size]
36
            with indices selected in [0, ..., num_labels].
37
    Outputs:
38
        if `labels` is not `None`:
39
            Outputs the CrossEntropy classification loss of the output with the labels.
40
        if `labels` is `None`:
41
            Outputs the classification logits of shape [batch_size, num_labels].
42
    Example usage:
43
    ```python
44
    # Already been converted into WordPiece token ids
45
    input_ids = torch.LongTensor([[31, 51, 99], [15, 5, 0]])
46
    input_mask = torch.LongTensor([[1, 1, 1], [1, 1, 0]])
47
    token_type_ids = torch.LongTensor([[0, 0, 1], [0, 1, 0]])
48
    config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
49
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
50
    num_labels = 2
51
    model = BertForSequenceClassification(config, num_labels)
52
    logits = model(input_ids, token_type_ids, input_mask)
53
54
    def __init__(self, num_labels=2):
55
        super(BertForSequenceClassification, self).__init__()
56
        self.num_labels = num_labels
57
        if par_cased ==1:
58
            self.bert = BertModel.from_pretrained('bert-base-cased')
59
        else:
60
            self.bert = BertModel.from_pretrained('bert-base-uncased')
61
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
62
        self.classifier = nn.Linear(config.hidden_size, num_labels)
63
        nn.init.xavier_normal_(self.classifier.weight)
64
    def forward(self, input_ids, token_type_ids=None, attention_mask=None, labels=None):
65
        _, pooled_output = self.bert(input_ids, token_type_ids, attention_mask, output_all_encoded_layers=False)
66
        pooled_output = self.dropout(pooled_output)
67
        logits = self.classifier(pooled_output)
68
69
        return logits
70
    def freeze_bert_encoder(self):
71
        for param in self.bert.parameters():
72
            param.requires_grad = False
73
    
74
    def unfreeze_bert_encoder(self):
75
        for param in self.bert.parameters():
76
            param.requires_grad = True
77
78
from pytorch_pretrained_bert import BertConfig
79
80
config = BertConfig(vocab_size_or_config_json_file=32000, hidden_size=768,
81
        num_hidden_layers=12, num_attention_heads=12, intermediate_size=3072)
82
83
num_labels = 2
84
model = BertForSequenceClassification(num_labels)
85
86
# Convert inputs to PyTorch tensors
87
#tokens_tensor = torch.tensor([tokenizer.convert_tokens_to_ids(zz)])
88
89
#logits = model(tokens_tensor)

2. Dataset Processing

2.1 Read the data and convert label into binary text

1
import pandas as pd
2
3
dat = pd.read_csv('/data/fake_or_real_news.csv')
4
dat.head()
5
dat = dat.drop(columns=['Unnamed: 0', 'title_vectors'])
6
for i in range(len(dat)):
7
    if dat.loc[i, 'label'] == "REAL": #REAL equal 0
8
        dat.loc[i, 'label'] = 0
9
    elif dat.loc[i, 'label'] == "FAKE": #FAKE equal 1
10
        dat.loc[i, 'label'] = 1
11
    if dat.loc[i, 'text'] == "":
12
        dat = dat.drop([i])
13
dat.head()

2.2 Combine the title and text

1
dat_plus = dat.copy()
2
dat_plus['title_text']=dat['title']+'. '+dat['text']
3
dat_plus = dat_plus.drop(columns=['title', 'text'])
4
5
dat_plus['title_text']

2.3 Use regular expression to drop non-sentence

1
import re
2
def cleanup(text):
3
    if par_cased == 0: # transfer into lower text if par_cased is false
4
        text = text.lower()
5
    text = re.sub(r'http[s]?://(?:[a-zA-Z]|[0-9]|[[email protected]&+]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text) # drop http[s]://*
6
    text = re.sub(u"\\{.*?}|\\[.*?]",'',text) # drop [*]
7
    text = re.sub(u"\(\@.*?\s", '', text) # drop something like (@EP_President)
8
    text = re.sub(u"\@.*?\s", '', text) # drop soething liek @EP_President
9
    text = re.sub(u"\#.*?\s", '', text) # drop something like #EP_President (maybe hashtag)
10
    text = re.sub(u"\© .*?\s", '', text) # drop something like © EP_President
11
    text = re.sub(r'pic.tw(?:[a-zA-Z]|[0-9]|[[email protected]&+#]|[!*\(\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+','',text) # drop pic.twitter.com/*
12
    text = re.sub(u"\*\*", '', text) # drop something like **Want FOX News First * in your inbox every day? Sign up here.**
13
    text = re.sub(u"|•|☮️|💚|🌍|😍|♦|☢", '', text) # drop something like  and • etc
14
    return(text)

2.4 Use EDA method to augment the text

1
import nlpaug.augmenter.char as nac
2
import nlpaug.augmenter.word as naw
3
import nlpaug.flow as nafc
4
5
from nlpaug.util import Action
6
import nltk
7
nltk.download('punkt')
8
9
if par_cased ==1:
10
    aug = naf.Sequential([
11
        naw.BertAug(action="substitute", aug_p=0.8, aug_n=20,model_path='bert-base-cased',tokenizer_path='bert-base-cased'),
12
        naw.BertAug(action="insert", aug_p=0.1)
13
    ])
14
else:
15
    aug = naf.Sequential([
16
        naw.BertAug(action="substitute", aug_p=0.8, aug_n=20,model_path='bert-base-uncased',tokenizer_path='bert-base-uncased'),
17
        naw.BertAug(action="insert", aug_p=0.1)
18
    ])
19
def aug_text(text):
20
    text = aug.augment(text)
21
    return(text)
22
from nltk.tokenize import sent_tokenize
23
def sentence_token_nltk(text):
24
    sent_tokenize_list = sent_tokenize(text)
25
    return sent_tokenize_list
26
def eda_text(text):
27
    if len(text) < 2:
28
        return(text)
29
    # split text into sentences
30
    text = sentence_token_nltk(text)
31
    if len(text) <= 1:
32
        return(text)
33
    if len(text) == 2:
34
        for i in range(len(text)):
35
            if i == 0:
36
                tmp_text = text[i]
37
            else:
38
                tmp_text += text[i]
39
        return(tmp_text)
40
    # operate prior 3 sentences
41
    for i in range(3):
42
        if i == 0:
43
            tmp_text = text[i]
44
        else:
45
            tmp_text += text[i]
46
    zz = tokenizer.tokenize(tmp_text)
47
    # operate proper sentences
48
    if len(zz) <= 500:
49
    #print(len(zz))
50
        tmp_text = aug_text(tmp_text)
51
    # conbine prior 3 sentences and rest sentences
52
    for j in range(len(text)-3):
53
        tmp_text += text[j+3]
54
    return(tmp_text)
55
56
if par_eda == 1: # use eda to operate sentences when par_eda is true
57
  for i in range(len(dat_plus['title_text'])):
58
      if i%6 == 1:       
59
          #print(i)
60
          dat_plus['title_text'][i] = copy.deepcopy(eda_text(dat_plus['title_text'][i]))
61
          dat_plus['title_text'][i] = "".join(dat_plus['title_text'][i])

3. Google Bert

1
import torch.nn.functional as F
2
3
#F.softmax(logits,dim=1)
4
5
from sklearn.model_selection import train_test_split
6
if par_cleanup == 1:
7
    X = dat_plus['title_text'].apply(cleanup)
8
else:
9
    X = dat_plus['title_text']
10
y = dat_plus['label']
11
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
12
13
X_train = X_train.values.tolist()
14
X_test = X_test.values.tolist()
15
16
y_train = pd.get_dummies(y_train).values.tolist() # convert to one-hot encoding
17
y_test = pd.get_dummies(y_test).values.tolist()
18
19
max_seq_length = 256
20
class text_dataset(Dataset):
21
    def __init__(self,x_y_list, transform=None):
22
        
23
        self.x_y_list = x_y_list
24
        self.transform = transform
25
        
26
    def __getitem__(self,index):
27
        
28
        tokenized_title_text = tokenizer.tokenize(self.x_y_list[0][index])
29
        
30
        if len(tokenized_title_text) > max_seq_length:
31
            tokenized_title_text = tokenized_title_text[:max_seq_length]
32
            
33
        ids_title_text  = tokenizer.convert_tokens_to_ids(tokenized_title_text) #tokens->input_ids
34
35
        padding = [0] * (max_seq_length - len(ids_title_text))
36
        
37
        ids_title_text += padding # use padding to make the same ids
38
        
39
        assert len(ids_title_text) == max_seq_length
40
        
41
        #print(ids_title_text)
42
        ids_title_text = torch.tensor(ids_title_text)
43
        
44
        label = self.x_y_list[1][index] # color        
45
        list_of_labels = [torch.from_numpy(np.array(label))]
46
        
47
        
48
        return ids_title_text, list_of_labels[0]
49
    
50
    def __len__(self):
51
        return len(self.x_y_list[0])

3.1 Create data dictionary

1
batch_size = 16 # divide into 16 batches
2
3
train_lists = [X_train, y_train]
4
test_lists = [X_test, y_test]
5
6
training_dataset = text_dataset(x_y_list = train_lists )
7
8
test_dataset = text_dataset(x_y_list = test_lists )
9
10
dataloaders_dict = {'train': torch.utils.data.DataLoader(training_dataset, batch_size=batch_size, shuffle=True, num_workers=0),
11
                   'val':torch.utils.data.DataLoader(test_dataset, batch_size=batch_size, shuffle=True, num_workers=0)
12
                  }                
13
dataset_sizes = {'train':len(train_lists[0]),
14
                'val':len(test_lists[0])}
15
16
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
17
print(device)

3.2 Define the train model

1
def train_model(model, criterion, optimizer, scheduler, num_epochs=25):
2
   since = time.time()
3
   print('starting')
4
   best_model_wts = copy.deepcopy(model.state_dict())
5
   best_loss = 100
6
   best_f1 = 0.978
7
   best_acc_test = 0.96
8
   best_acc_train = 0.96
9
   best_auc = 0.96
10
   for epoch in range(num_epochs):
11
       print('Epoch {}/{}'.format(epoch, num_epochs - 1))
12
       print('-' * 10)
13
14
       # Each epoch has a training and validation phase
15
       for phase in ['train', 'val']:
16
           if phase == 'train':
17
               scheduler.step()
18
               model.train()  # Set model to training mode
19
           else:
20
               model.eval()   # Set model to evaluate mode
21
22
           running_loss = 0.0
23
           
24
           label_corrects = 0
25
           TP = 0
26
           TN = 0
27
           FN = 0
28
           FP = 0
29
           total_scores = []
30
           total_tar = []
31
           # Iterate over data.
32
           for inputs, label in dataloaders_dict[phase]:
33
               #inputs = inputs
34
               #print(len(inputs),type(inputs),inputs)
35
               #inputs = torch.from_numpy(np.array(inputs)).to(device) 
36
               inputs = inputs.to(device) 
37
               label = label.to(device)
38
39
               # zero the parameter gradients
40
               optimizer.zero_grad()
41
42
               # forward
43
               # track history if only in train
44
               with torch.set_grad_enabled(phase == 'train'):
45
                   # acquire output
46
                   outputs = model(inputs)
47
48
                   outputs = F.softmax(outputs,dim=1)
49
                   
50
                   loss = criterion(outputs, torch.max(label.float(), 1)[1])
51
                   # backward + optimize only if in training phase
52
                   if phase == 'train':
53
                       
54
                       loss.backward()
55
                       optimizer.step()
56
57
               # statistics
58
               running_loss += loss.item() * inputs.size(0)
59
               label_corrects += torch.sum(torch.max(outputs, 1)[1] == torch.max(label, 1)[1]) #返回每一行中最大值的那个元素,且返回其索引(返回最大元素在这一行的列索引)
60
               pred_choice = torch.max(outputs, 1)[1]
61
               target = torch.max(label, 1)[1]
62
               scores = pred_choice.cpu().tolist()
63
               tar = target.cpu().tolist()
64
               total_scores = total_scores + scores
65
               total_tar = total_tar + tar
66
67
               tmp_tp = 0
68
               tmp_tn = 0
69
               tmp_fn = 0
70
               tmp_fp = 0
71
               if pred_choice.numel()!= target.numel():
72
                   print("error")
73
               for i in range(pred_choice.numel()):
74
                   if pred_choice[i] == 1 and target[i] == 1 :
75
                       tmp_tp = tmp_tp + 1
76
                   elif pred_choice[i] == 0 and target[i] == 0 :
77
                       tmp_tn = tmp_tn + 1
78
                   elif pred_choice[i] == 0 and target[i] == 1 :
79
                       tmp_fn = tmp_fn + 1
80
                   elif pred_choice[i] == 1 and target[i] == 0 :
81
                       tmp_fp = tmp_fp + 1
82
               # TP    both predict and label are 1
83
               TP += tmp_tp
84
               # TN    both predict and label are 0
85
               TN += tmp_tn
86
               # FN    predict 0 label 1
87
               FN += tmp_fn
88
               # FP    predict 1 label 0
89
               FP += tmp_fp
90
           epoch_loss = running_loss / dataset_sizes[phase]
91
           p = TP / (TP + FP)
92
           r = TP / (TP + FN)
93
           F1 = 2 * r * p / (r + p)
94
           acc = (TP + TN) / (TP + TN + FP + FN)
95
96
           ### draw ROC curce
97
           tpr = TP/(TP+FN)
98
           fpr = FP/(FP+TN)
99
           tnr = TN/(FP+TN)
100
101
           total_scores = np.array(total_scores)
102
           total_tar = np.array(total_tar)
103
           fpr, tpr, thresholds = roc_curve(total_tar, total_scores)
104
           roc_auc = auc(fpr, tpr) 
105
           plt.title('ROC')
106
           if roc_auc > best_auc:
107
               best_auc = roc_auc
108
           if epoch < num_epochs -1:
109
               plt.plot(fpr, tpr,'b',label='AUC = %0.4f'% roc_auc)
110
           if epoch == num_epochs -1:
111
               plt.plot(fpr, tpr, color='darkorange', label='MAX AUC = %0.4f'% best_auc) 
112
           plt.legend(loc='lower right')
113
           plt.plot([0,1],[0,1],'r--')
114
           plt.ylabel('TPR')
115
           plt.xlabel('FPR')
116
           plt.show()
117
118
           #print('{} p: {:.4f} '.format(phase,p ))
119
           #print('{} r: {:.4f} '.format(phase,r ))
120
           print('{} F1: {:.4f} '.format(phase,F1 ))
121
           print('{} accuracy: {:.4f} '.format(phase,acc ))
122
123
           if phase == 'val' and epoch_loss < best_loss:
124
               print('saving with loss of {}'.format(epoch_loss),
125
                     'improved over previous {}'.format(best_loss))
126
               best_loss = epoch_loss
127
               best_model_wts = copy.deepcopy(model.state_dict())
128
               #torch.save(model.state_dict(), '/content/drive/My Drive/Colab Notebooks/bert_model_test_loss.pth')
129
           if F1 > best_f1:
130
               best_f1 = F1
131
           if phase == 'val' and acc > best_acc_test:
132
               best_acc_test = acc
133
           if phase == 'train' and acc > best_acc_train:
134
               best_acc_train = acc
135
               #best_model_wts = copy.deepcopy(model.state_dict())
136
               #torch.save(model.state_dict(), '/content/drive/My Drive/Colab Notebooks/bert_model_test_f1.pth')
137
       print()
138
139
   time_elapsed = time.time() - since
140
   print('Training complete in {:.0f}m {:.0f}s'.format(
141
       time_elapsed // 60, time_elapsed % 60))
142
   print("Parament setting: ")
143
   print("cased: ",par_cased)
144
   print("cleanup: ",par_cleanup)
145
   print("eda: ",par_eda)
146
   print('Best train Acc: {:4f}'.format(float(best_acc_train)))
147
   print('Best test Acc: {:4f}'.format(float(best_acc_test)))
148
   print('Best f1 score: {:4f}'.format(float(best_f1)))
149
   # load best model weights
150
   model.load_state_dict(best_model_wts)
151
   return model

4. Final output

4.1 Model details

1
print(model)
2
model.to(device)

4.2 F1 and other details

1
model_ft1 = train_model(model, criterion, optimizer_ft, exp_lr_scheduler,num_epochs=10)
  ,

# Recommend Posts
 1.1/1/2019 - 12/31/2019
 2.1/1/2019 - 12/31/2019
 3.82 年生的金智英
 4.Hexo主题折腾日记(二) 添加豆瓣和聊天插件
 5.Hexo主题折腾日记(一) 从cactus到icarus

Comments


修子也好,远野也好,对于情感世界发生的事,很难简单以对和错来衡量,在这样的世界里沉浮,飘落的是情感,不败的总是每年盛开的樱花。 --《情人》

JavaScript NLP Python RL
Your browser is out-of-date!

Update your browser to view this website correctly. Update my browser now

×