Model Building in Vowpal Wabbit

Vowpal Wabbit (github) is well-known for its many neat tricks and blazing speed. It supports several loss functions, squared, logistic (also multiclass), hinge and quantile based. In addition, one can do sequence prediction, LDA and active learning.

To install download the latest version from Github (otherwise I was having problem with “invert_hash” option):

git clone git://github.com/JohnLangford/vowpal_wabbit.git</code> 
cd vowpal_wabbit/ 
make

There are quite a few options that one has to consider for successful completion of any task. Refer to this page for command line arguments.

Regression
I started with a regression and multi-class classification problem. The regression is based on Boston housing data (UCI Repository). The raw data need to be converted into VW format, which is

y_1 | f1:v1 f2:v2 .... fm:vm

The following code converts the UCI data to VW format. Note: I’ve extracted the column names a priori.

    raw_file = 'housing.data'
    col_names = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT']
    out_file = 'housing_vw.data'  # inputfile for VW
    fw = open(out_file, 'w')
    with open(raw_file, 'r') as finp:
        for line in finp:
            e = line.strip().split(' ')
            e = [w for w in e if len(w) > 0]
            out_str = e[-1] + ' | '
            for (feature,name) in zip(e,col_names):
                if float(feature) != 0:
                    out_str += name + ':' + feature + ' '
            fw.write(out_str + '\n')
    fw.close()

The input to VW (top 10 lines) will look like

24.00 | CRIM:0.00632 ZN:18.00 INDUS:2.310 NOX:0.5380 RM:6.5750 AGE:65.20 DIS:4.0900 RAD:1 TAX:296.0 PTRATIO:15.30 B:396.90 LSTAT:4.98 
21.60 | CRIM:0.02731 INDUS:7.070 NOX:0.4690 RM:6.4210 AGE:78.90 DIS:4.9671 RAD:2 TAX:242.0 PTRATIO:17.80 B:396.90 LSTAT:9.14 
34.70 | CRIM:0.02729 INDUS:7.070 NOX:0.4690 RM:7.1850 AGE:61.10 DIS:4.9671 RAD:2 TAX:242.0 PTRATIO:17.80 B:392.83 LSTAT:4.03 
33.40 | CRIM:0.03237 INDUS:2.180 NOX:0.4580 RM:6.9980 AGE:45.80 DIS:6.0622 RAD:3 TAX:222.0 PTRATIO:18.70 B:394.63 LSTAT:2.94 
36.20 | CRIM:0.06905 INDUS:2.180 NOX:0.4580 RM:7.1470 AGE:54.20 DIS:6.0622 RAD:3 TAX:222.0 PTRATIO:18.70 B:396.90 LSTAT:5.33 
28.70 | CRIM:0.02985 INDUS:2.180 NOX:0.4580 RM:6.4300 AGE:58.70 DIS:6.0622 RAD:3 TAX:222.0 PTRATIO:18.70 B:394.12 LSTAT:5.21 
22.90 | CRIM:0.08829 ZN:12.50 INDUS:7.870 NOX:0.5240 RM:6.0120 AGE:66.60 DIS:5.5605 RAD:5 TAX:311.0 PTRATIO:15.20 B:395.60 LSTAT:12.43 
27.10 | CRIM:0.14455 ZN:12.50 INDUS:7.870 NOX:0.5240 RM:6.1720 AGE:96.10 DIS:5.9505 RAD:5 TAX:311.0 PTRATIO:15.20 B:396.90 LSTAT:19.15 
16.50 | CRIM:0.21124 ZN:12.50 INDUS:7.870 NOX:0.5240 RM:5.6310 AGE:100.00 DIS:6.0821 RAD:5 TAX:311.0 PTRATIO:15.20 B:386.63 LSTAT:29.93 
18.90 | CRIM:0.17004 ZN:12.50 INDUS:7.870 NOX:0.5240 RM:6.0040 AGE:85.90 DIS:6.5921 RAD:5 TAX:311.0 PTRATIO:15.20 B:386.71 LSTAT:17.10 

The command for VW will be vw housing_vw.data --readable_model housing.model. The housing.model file stores the regression model as shown below:

Version 8.0.0
Min label:0.000000
Max label:50.000000
bits:18
lda:0
0 ngram: 
0 skip: 
options:
:0
2580:0.332198
54950:0.023167
102153:0.541679
104042:0.020405
108300:0.003121
116060:2.742849
125597:0.061628
141890:0.011573
158346:0.007335
165794:2.783368
170288:0.030039
182658:0.234611
223085:0.115288
232476:0.095232

The last 14 lines have the coefficients for the 13 covariates (hashed version) and one for the intercept (constant). The variable names are hashed (e.g., RM = 2580, ZN = 54950 etc.). With the “invert_hash” option (vw -d housing_vw.data --invert_hash housing.hash) the model (housing.hash file) contains the original covariates (Note: Constant is also there in between, hash value is 116060):

Version 8.0.0
Min label:0.000000
Max label:50.000000
bits:18
lda:0
0 ngram: 
0 skip: 
options:
:0
AGE:104042:0.020405
B:158346:0.007335
CHAS:102153:0.541679
CRIM:141890:0.011573
Constant:116060:2.742849
DIS:182658:0.234611
INDUS:125597:0.061628
LSTAT:170288:0.030039
NOX:165794:2.783368
PTRATIO:223085:0.115288
RAD:232476:0.095232
RM:2580:0.332198
TAX:108300:0.003121
ZN:54950:0.023167

Multi-class Classification
The classification is based on Forest data (UCI Repository). The dataset is converted into VW format by the following python code:

class_labels = {'s': '4', 'h': '2', 'd': '1', 'o': '3'}
    raw_file = 'forest_type_testing.csv'
    out_file = 'forest_type_testing_vw.data'
    fw = open(out_file, 'w')
    with open(raw_file) as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            out_str = class_labels[row['class'].strip()] + ' | '
            feature_names = row.keys()
            feature_names.remove('class')
            for name in feature_names:
                if float(row[name]) != 0:
                    out_str += name + ':' + row[name].strip() + ' '
            fw.write(out_str + '\n')
    fw.close()

where the classes “s”, “h”, “d” and “o” are converted into indices (starting with 1). There is a separate training and test set where the first few lines of the training set after conversion to VW format look like:

1 | b4:91 b5:59 b6:101 b7:93 b1:39 b2:36 b3:57 b8:27 b9:60 pred_minus_obs_H_b4:7.97 pred_minus_obs_H_b5:-32.92 pred_minus_obs_H_b6:-38.92 pred_minus_obs_H_b7:-14.94 pred_minus_obs_H_b1:75.7 pred_minus_obs_H_b2:14.86 pred_minus_obs_H_b3:40.35 pred_minus_obs_H_b8:4.47 pred_minus_obs_H_b9:-2.36 pred_minus_obs_S_b5:-1.6 pred_minus_obs_S_b4:-21.03 pred_minus_obs_S_b7:-22.5 pred_minus_obs_S_b6:-6.18 pred_minus_obs_S_b1:-18.41 pred_minus_obs_S_b3:-6.43 pred_minus_obs_S_b2:-1.88 pred_minus_obs_S_b9:-7.86 pred_minus_obs_S_b8:-5.2
2 | b4:112 b5:51 b6:98 b7:92 b1:84 b2:30 b3:57 b8:26 b9:62 pred_minus_obs_H_b4:-16.74 pred_minus_obs_H_b5:-24.92 pred_minus_obs_H_b6:-36.33 pred_minus_obs_H_b7:-15.67 pred_minus_obs_H_b1:30.58 pred_minus_obs_H_b2:20.42 pred_minus_obs_H_b3:39.83 pred_minus_obs_H_b8:8.16 pred_minus_obs_H_b9:-2.26 pred_minus_obs_S_b5:-1.99 pred_minus_obs_S_b4:-18.79 pred_minus_obs_S_b7:-23.41 pred_minus_obs_S_b6:-6.18 pred_minus_obs_S_b1:-16.27 pred_minus_obs_S_b3:-6.25 pred_minus_obs_S_b2:-1.95 pred_minus_obs_S_b9:-10.83 pred_minus_obs_S_b8:-8.87
4 | b4:99 b5:51 b6:93 b7:84 b1:53 b2:25 b3:49 b8:26 b9:58 pred_minus_obs_H_b4:3.25 pred_minus_obs_H_b5:-24.89 pred_minus_obs_H_b6:-30.38 pred_minus_obs_H_b7:-3.6 pred_minus_obs_H_b1:63.2 pred_minus_obs_H_b2:26.7 pred_minus_obs_H_b3:49.28 pred_minus_obs_H_b8:4.15 pred_minus_obs_H_b9:-1.46 pred_minus_obs_S_b5:-0.48 pred_minus_obs_S_b4:-17.73 pred_minus_obs_S_b7:-19.97 pred_minus_obs_S_b6:-4.69 pred_minus_obs_S_b1:-15.92 pred_minus_obs_S_b3:-4.64 pred_minus_obs_S_b2:-1.79 pred_minus_obs_S_b9:-7.07 pred_minus_obs_S_b8:-4.1
4 | b4:103 b5:47 b6:92 b7:82 b1:59 b2:26 b3:49 b8:25 b9:56 pred_minus_obs_H_b4:-6.2 pred_minus_obs_H_b5:-20.98 pred_minus_obs_H_b6:-30.28 pred_minus_obs_H_b7:-5.03 pred_minus_obs_H_b1:55.54 pred_minus_obs_H_b2:24.5 pred_minus_obs_H_b3:47.9 pred_minus_obs_H_b8:7.77 pred_minus_obs_H_b9:2.68 pred_minus_obs_S_b5:-2.34 pred_minus_obs_S_b4:-22.03 pred_minus_obs_S_b7:-27.1 pred_minus_obs_S_b6:-6.6 pred_minus_obs_S_b1:-13.77 pred_minus_obs_S_b3:-6.34 pred_minus_obs_S_b2:-2.53 pred_minus_obs_S_b9:-10.81 pred_minus_obs_S_b8:-7.99
1 | b4:103 b5:64 b6:106 b7:114 b1:57 b2:49 b3:66 b8:28 b9:59 pred_minus_obs_H_b4:-1.33 pred_minus_obs_H_b5:-37.99 pred_minus_obs_H_b6:-43.57 pred_minus_obs_H_b7:-34.25 pred_minus_obs_H_b1:59.44 pred_minus_obs_H_b2:2.62 pred_minus_obs_H_b3:32.02 pred_minus_obs_H_b8:1.83 pred_minus_obs_H_b9:-2.94 pred_minus_obs_S_b5:-0.85 pred_minus_obs_S_b4:-23.74 pred_minus_obs_S_b7:-22.83 pred_minus_obs_S_b6:-5.5 pred_minus_obs_S_b1:-21.74 pred_minus_obs_S_b3:-4.62 pred_minus_obs_S_b2:-1.64 pred_minus_obs_S_b9:-5.84 pred_minus_obs_S_b8:-2.74

To build a classifier using one-against-all approach we have to run the following command:
vw -d forest_type_training_vw.data --oaa 4 --invert_hash forest_type.model. The model will look like

Version 8.0.0
Min label:-1.000000
Max label:1.000000
bits:18
lda:0
0 ngram:
0 skip:
options: --oaa 4
:0
Constant:202096:-0.092896
Constant[1]:202097:-0.010607
Constant[2]:202098:-0.122281
Constant[3]:202099:0.037829
b1:250508:-0.002837
b1[1]:250509:0.002688
b1[2]:250510:-0.000933
b1[3]:250511:-0.000484
b2:259424:0.000904
b2[1]:259425:-0.002864
b2[2]:259426:0.000600
b2[3]:259427:-0.003162
b3:194552:0.000042
b3[1]:194553:-0.000804
b3[2]:194554:-0.000013
b3[3]:194555:-0.001202
b4:206136:-0.000711

where the coefficients for four models are stored. To test the model on the test set we need to execute the following command:
vw -d forest_type_testing_vw.data -t -i forest_type.model -p forest_type.predict

Topic Model
For topic modeling using LDA (Online learning for LDA) I have taken 20 newsgroup data (UCI Repository). We need to do some cleaning and standard text processing before the data can be used for topic modeling. The python code shown below creates VW input for LDA.

import math
import nltk
import os
import re
import sys
import string
from collections import Counter
from nltk import tokenize
from nltk import word_tokenize
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer("english")
remove_words = ['the', 'and', 'is', 'are', 'a', 'an', 'nntppostinghost', 'maxaxaxaxaxaxaxaxaxaxaxaxaxaxax',
                'stephanopoulo', 'composmswindowsmisc', 'compsysmachardwar', 'sandviknewtonapplecom',
                'accessdigexnet', 'serazumauucp', 'composmswindowsapp',
                'that', 'for', 'you', 'this', 'not', 'have', 'with', 'but', 'they', 'from', 'can', 'what',
                'there', 'all', 'will', 'would', 'your', 'were', 'their', 'scispac', 'his', 'who', 'scsi'
                ]
pattern_1 = re.compile(".*edu")  # remove email ids
def remove_punctuation(s):
    for c in string.punctuation:
        s = s.replace(c, "")
    return s

def read_news_file(fname):
    sent = ''
    '''
    There are few lines in the beginning of each mail that we need to ignore
    '''
    initials = ['Newsgroups:', 'Path:', 'From:', 'Subject:', 'Message-ID:', 'Sender:', 'Organization:',
                'References:', 'Date:', 'Lines:']
    words = []
    with open(fname) as f:
        for line in f:
            w = line.strip().split(' ')
            if w[0] not in initials:
                sent += line
    sent = sent.lower()
    sent = remove_punctuation(sent)
    sent = sent.translate(None, string.digits)
    tokens = word_tokenize(sent)
    tokens_1 = [w for w in tokens if '@' not in w]  # get rid of emails
    tokens_2 = [nltk.stem.WordNetLemmatizer().lemmatize(w) for w in tokens_1]
    tokens_3 = [stemmer.stem(w.decode('ascii', 'ignore')) for w in tokens_2]
    words.extend(tokens_3)
    words1 = [w for w in words if len(w) > 2 and w not in remove_words]
    words = [w for w in words1 if pattern_1.match(w) == None]
    return words

if __name__ == "__main__":

    data_dir = '20_newsgroups'  # 128,606 unique words (>1), 127,965 (>2)
    max_words = 65000
    min_count = 5
    vocab_file = 'newsgroup_selected_words.dat'
    out_file = 'newsgroups_lda_vw.data'
    vocab = set()
    word_counter = Counter()
    for root, dirnames, filenames in os.walk(data_dir):
        for filename in filenames:
            full_file_path = root + '/' + filename
            words = read_news_file(full_file_path)
            vocab = vocab.union(set(words))
            word_counter.update(words)
        print len(vocab)
    print len(word_counter)

    top_words_and_counts = word_counter.most_common(max_words)
    top_words = [e[0] for e in top_words_and_counts if e[1] >= min_count]  # greater than min_count
    fv = open(vocab_file, 'w')
    count_word = 0
    for w in top_words:
        out_str = str(count_word) + '\t' + w + '\t' + str(word_counter[w]) + '\n'
        fv.write(out_str)
        count_word += 1
    fv.close()
    print 'Selected {0} words in the vocabulary'.format(len(top_words))

    fw = open(out_file, 'w')
    count_file = 0
    for root, dirnames, filenames in os.walk(data_dir):
        print root
        for filename in filenames:
            full_file_path = root + '/' + filename
            words = read_news_file(full_file_path)
            filtered_words = [w for w in words if w in top_words]
            out_str = '| '
            for ii in range(len(top_words)):
                if words.count(top_words[ii]) > 0:
                    out_str += str(ii) + ':' + str(words.count(top_words[ii])) + ' '
            fw.write(out_str + '\n')
            count_file += 1
    fw.close()
    print '{0} lines are written in {1}'.format(count_file, out_file)

The code creates an input file (newsgroups_lda_vw.data) in the following format:

| 0:1 1:2 4:1 8:1 13:1 17:3 21:1 24:3 25:1 26:1 28:1 43:2 48:1 54:1 70:1 88:6 91:1 97:2 113:2 114:4 117:1 135:1 156:1 162:1 179:1 180:1 188:1 189:1 198:2 211:1 217:2 246:1 260:2 276:1 282:1 293:1 375:1 415:1 467:1 480:1 487:1 548:1 632:7 711:4 770:1 795:1 873:1 875:1 902:1 957:1 1014:1 1015:6 1077:1 1082:1 1212:1 1329:2 1446:2 1464:1 1544:1 1614:1 1863:1 1956:1 1979:1 2043:1 2365:1 2684:1 2771:2 2801:2 3095:1 3524:1 4365:1 4417:1 5548:1 7686:2 9272:1 9768:2 11010:2 17901:1 18109:1 19269:2 21364:1
| 74:1 114:1 138:1 167:1 176:1 192:1 982:1 1015:1 1586:1 2305:1 6518:1
| 0:1 4:1 11:1 42:1 50:1 64:1 69:1 122:1 163:2 339:1 356:1 477:1 537:1 548:1 702:1 752:1 1015:1 1055:1 1077:1 1308:1 1628:1 1728:1 2056:2 2131:1 2323:1 6173:1 12568:1 18663:1
| 9:1 18:1 21:1 25:1 50:1 72:1 128:1 188:1 271:1 275:1 370:1 477:1 521:1 1188:1 1446:1 1455:1 3838:1 11018:1 14451:1 16063:1
| 0:1 4:1 14:1 35:1 39:1 79:2 94:1 120:1 123:1 142:1 176:1 206:1 224:1 250:1 266:1 280:1 303:1 311:1 313:3 331:1 332:2 370:1 391:1 409:1 422:1 425:1 493:1 580:2 582:2 705:1 863:1 873:1 1800:1 1817:1 1843:1 2123:3 2553:1 2561:1 2784:1 3237:1 4127:1 5374:1 13885:1 14065:1 15883:1 20915:1
| 0:1 4:1 5:1 6:1 10:1 14:1 16:1 17:1 46:1 54:1 60:1 62:1 67:1 72:3 79:1 87:1 99:1 104:1 108:1 109:1 121:1 135:1 160:1 162:1 175:1 210:1 246:1 257:2 282:1 287:1 307:1 347:2 370:1 405:1 410:1 428:1 457:1 574:2 620:1 662:1 676:1 691:1 723:1 873:1 1027:1 1108:1 1118:1 1235:1 1296:1 1317:1 1393:1 1418:3 1446:1 1711:1 1814:2 2046:1 2238:1 2347:1 2500:1 2708:1 3566:1 4621:1 5789:1 6070:1 6942:1 8947:1 9004:1 9397:1 9852:1 11921:1 13227:1

where there is no label or namespace and the words are replaced by their indices and the values are the corresponding counts. To run topic modelling we have to execute the following command:

vw -d newsgroups_lda_vw.data --lda 10 --lda_alpha 0.1 --lda_rho 0.1 --lda_D 19997 --minibatch 256 --power_t 0.5 --initial_t 1 -b 15 -c -k --passes 10 -p newsgroups_predictions.dat --readable_model topics.dat

Note: It is important to include -d before the name of the input file. The explanations for other parameters are available here and here). The output of VW are in two files, (1) contains the topic distribution for each document and (2) topic-word distribution. The second file looks like

Version 8.0.0
Min label:0.000000
Max label:1.000000
bits:15
lda:10
0 ngram:
0 skip:
options: --lda 10
0 352.334259 285.166229 0.100307 0.698256 183.093353 636.476440 546.790833 87.499016 13479.592773 0.100183
1 0.100024 274.110382 0.100019 0.100028 244.759354 665.037476 14.111968 0.100022 14747.403320 10.416098
2 0.100017 3077.327393 0.100025 0.192351 0.100015 0.100035 0.100030 0.100036 7346.701172 4433.718750
3 0.100020 0.100499 0.100024 0.100102 109.228210 67.088684 11.771021 0.100021 14015.679688 0.103065
4 301.307709 170.543198 0.103659 7.471370 165.531708 531.659119 658.343811 0.100125 10310.763672 0.100030
5 0.100020 1213.295166 0.100018 0.100032 88.423347 3.309083 22.966171 0.100023 9224.307617 1166.300537

where from ninth line onwards we have the topic-wise distribution of each word (indexed from 0). Now, for each topic we can extract the top N words. The python code to do the same is

# Get the indices of top n words
def top_indices(x, n):
    xs = sorted(x, reverse=True)
    indices = []
    for e in xs[:n]:
        indices.append(x.index(e))
    return indices

if __name__ == "__main__":

    vocab_file = 'newsgroup_selected_words.dat'
    vw_out_file = 'topics.dat'
    out_file = 'topic_words.dat'
    num_topics = 10
    num_top_words = 10

    vocab_words = []
    with open(vocab_file) as f:
        for line in f:
            w = line.strip().split('\t')
            vocab_words.append(w[1])

    topics = [[] for w in range(num_topics)]
    with open(vw_out_file) as f:
        for _ in xrange(8):
            next(f)
        for line in f:
            w = line.strip().split(' ')[1:]
            for t in range(num_topics):
                topics[t].append(float(w[t]))

    header = '\t'.join(['Topic-'+str(i) for i in range(num_topics)])
    fw = open(out_file, 'w')
    fw.write(header+'\n')
    topics_words = []
    for t in range(num_topics):
        top_words = [vocab_words[i] for i in top_indices(topics[t], num_top_words)]
        topics_words.append(top_words)
    for t in range(num_top_words):
        out_str = '\t'.join([topics_words[w][t] for w in range(num_topics)])
        fw.write(out_str + '\n')
    fw.close()

The output of this code in topic_words.dat for top-10 words looks like

Topic-0    Topic-1    Topic-2    Topic-3    Topic-4 Topic-5    Topic-6 Topic-7 Topic-8 Topic-9
car        drive      medic      space      game    god        govern  entri   one     file
bike       use        diseas     earth      year    christian  law     error   about   use
dod        key        patient    engin      team    peopl      peopl   int     write   window
ride       system     drug       solar      play    jesus      right   line    some    program
write      card       doctor     planet     player  armenian   state   author  articl  imag
road       chip       studi      spacecraft new     say        gun     name    like    system
duo        distribut  health     water      launch  church     public  return  dont    avail
front      comput     effect     oil        last    moral      secur   ripem   just    softwar
articl     new        treatment  probe      first   believ     countri char    out     version
brake      work       cancer     energi     win     exist      nation  string  when    which

Some of the topics are clear from their participating words, namely, Topic-0 is about driving, Topic-2 is medical related, Topic-3 is space related, Topic-4 is game, Topic-5 is religion and Topic-6 is governance related.

Sequence Learning
VW has implemented SEARN structured prediction algorithm, which is used here for part-of-speech tagging. The dataset is downloaded from pos_data which is already in required format. The first two lines are “Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29. Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.”, which are converted into multiple lines (one for each word) separated by a blank line as shown below:

1  |w Pierre
1  |w Vinken
2  |w ,
3  |w 61
4  |w years
5  |w old
2  |w ,
6  |w will
7  |w join
8  |w the
9  |w board
10 |w as
8  |w a
5  |w nonexecutive
9  |w director
1  |w Nov.
3  |w 29
11 |w .

1  |w Mr.
1  |w Vinken
12 |w is
9  |w chairman
10 |w of
1  |w Elsevier
1  |w N.V.
2  |w ,
8  |w the
1  |w Dutch
13 |w publishing
9  |w group
11 |w .

There are total 38219 sentences in the dataset. To build a POS tagger we execute vw -b 24 -k -c -d pos.gz --passes 4 --search_task sequence --search 45 --holdout_after 38219.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s