MultiIndex: How to create a multi-index

Algorithm

The process of the multi-index construction is described in our paper. Here we provide the details of the implementation.

After the vocabularies are trained (see below), the index construction progress in two stages: assigning points to multi-index entries ("coarse quantization") and calculating information for reranking. Because one can use different reranking approaches for the same coarse quantization, the first stage of the algorithm saves the coarse quantizations for all points in the database to hard drive. These coarse quantizations are just the entry identifiers (e.g. codeword pairs). So if the file with coarse quantizations has already been produced there is no need to calculate them again (in this case, remove the flag --build_coarse from the command line parameters).

In the CPU, a multi-index consists of a long onedimensional array containing the compressed points aligned by entries (i.e. a group of points belonging to the same entry is stored contiguously) and a table containing the starting index in the array for every entry of the multi-index. The class MultiIndexer is thus a C++ template by the type of the record in this array. In this way, you can easy implement your own reranking approach by defining new structure NewRecordType and implementing function GetRecord<NewRecordType> for your structure.

For index contstruction you should provide coarse vocabularies for building the multi-index structure and fine vocabularies for calculating the reranking information (assuming that you are using the provided reranking procedure). We assume that these files are prepared outside this code (C++ is not the simplest way to create vocabularies, just for your reference we provide a MATLAB-script to create them below).

File formats

Our code uses the .bvecs and .fvecs file formats developed by INRIA LEAR and TEXMEX groups.

Coarse vocabularies
Our code assumes that coarse vocabularies are in the following format:
4 bytes(one int32) - number of items in each vocabulary (N)
4 bytes(one int32) - dimension of item (d)
4*N*d*M bytes(N*d*M floats) - vocabulary items one after another (M is the multiplicity of algorithm)

Matlab script to build coarse vocabularies

clear all;

all_data = bvecs_read('sift1M.bvecs');

all_data = single(all_data);
vocabSize = 4096;
% add implementation of K-means
vocab1 = your_kmeans(single(all_data(1:end/2,:)),vocabSize);
vocab2 = your_kmeans(single(all_data(end/2+1:end,:)),vocabSize);

file = fopen(['sift1M_double_4096_' num2str(vocabSize) '.dat'], 'w');
dim = size(vocab1, 1);
sz = size(vocab1, 2);
fwrite(file, dim, 'int32');
fwrite(file, sz, 'int32');
fwrite(file, vocab1, 'float');
fwrite(file, vocab2, 'float');
fclose(file);
save(['sift1M_double_4096_' num2str(vocabSize) '.mat'], 'vocab1', 'vocab2');

Fine vocabularies
Our code assumes that fine vocabularies are in the following format:
4 bytes(one int32) - number of vocabularies (m)
4 bytes(one int32) - number of items in each vocabulary (N)
4 bytes(one int32) - dimension of item (d)
4*N*d*m bytes(N*d*m floats) - vocabulary items one after another

Matlab script to build fine vocabularies (used VlFeat library)

clear all;
all_data = fvecs_read('sift1M.fvecs');

vocabSize = 4096;
load(['sift1M_double_' num2str(vocabSize) '.mat'], 'vocab1', 'vocab2');

vocab1 = int32(vocab1);
vocab2 = int32(vocab2);
i1 = vl_ikmeanspush(all_data(1:end/2,:), vocab1);
i2 = vl_ikmeanspush(all_data(end/2+1:end,:), vocab2);  
residual = single(all_data)- single([vocab1(:,i1); vocab2(:,i2)]);
bytes_per_point = 8;    

D = size(residual,1) / bytes_per_point;
residual_vocab = cell(bytes_per_point,1);
dist = cell(bytes_per_point,1);
for m = 1:bytes_per_point
    chunk = residual(D*m-D+1:D*m,:);
        % add implementation of K-means
    residual_vocab{m} = your_kmeans(chunk,256);
    dist{m} = vl_alldist2(residual_vocab{m});          
end

save(['sift1M_double_4096_8.mat'],'residual_vocab','dist');

file = fopen(['sift1M_double_4096_8.dat'], 'w');
vocabs_count = size(residual_vocab, 1);
each_vocab_count = size(residual_vocab{1}, 2);
each_vocab_dim = size(residual_vocab{1}, 1);
fwrite(file, vocabs_count, 'int32');
fwrite(file, each_vocab_count, 'int32');
fwrite(file, each_vocab_dim, 'int32');
for i = 1:vocabs_count
    for j = 1:each_vocab_count
        a = residual_vocab{i}(:,j);
        fwrite(file, a, 'float');
    end
end
fclose(file);

Indexing sample

To build an invertered index for a set of points you should run "indexer_launcher" application with some command line parameters.

--threads_count            - the number of threads to use for the multi-threaded index construction
--multiplicity             - the number of groups of dimensions the vectors will be split into. Equals 2 or 4 for the experiments in the paper.
--points_file              - the path to the file with the vector database (should be in .bvecs or .fvecs format)
--coarse_vocabs_file       - the path to the file with the coarse vocabularies (see the format description above)
--fine_vocabs_file         - the path to the file with fine vocabularies for reranking (see the format description above)
--input_point_type         - "BVEC" or "FVEC"
--points_count             - the number of points to index
--space_dim                - the space dimensionality (e.g. 128 for SIFTs)
--files_prefix             - the common prefix for storing the multi-index files (used to control runs with different parameters)
--coarse_quantization_file - the path to the file with coarse quantizations
--metainfo_file            - the path to the file with metainformation (deprecated, just write "fake.txt")
--use_residuals            - the reranking method flag. Specify it if you want to use residuals for reranking (Multi-D-ADC) and omit it if you want to use initial points (Multi-ADC)
--build_coarse             - specify this flag if you want to recompute coarse quantizations (otherwise, will use the previously computed, if available)

Windows users can try launch_indexer.bat script. It launches indexing of the ANN_SIFT1M dataset using the provided vocabularies. Unix users should just write a similar launch_indexer.sh script.