deepdish

Serving Up Chicago-Style Deep Learning.

Latest post:

The building blocks of Deep Learning 21 Nov 2015

© 2015. All rights reserved.

Creating an LMDB database in Python

LMDB is the database of choice when using Caffe with large datasets. This is a tutorial of how to create an LMDB database from Python. First, let's look at the pros and cons of using LMDB over HDF5.

Reasons to use HDF5:

Reasons to use LMDB:

LMDB from Python

You will need the Python package lmdb as well as Caffe's python package (make pycaffe in Caffe). LMDB provides key-value storage, where each <key, value> pair will be a sample in our dataset. The key will simply be a string version of an ID value, and the value will be a serialized version of the Datum class in Caffe (which are built using protobuf).

import numpy as np
import lmdb
import caffe

N = 1000

# Let's pretend this is interesting data
X = np.zeros((N, 3, 32, 32), dtype=np.uint8)
y = np.zeros(N, dtype=np.int64)

# We need to prepare the database for the size. We'll set it 10 times
# greater than what we theoretically need. There is little drawback to
# setting this too big. If you still run into problem after raising
# this, you might want to try saving fewer entries in a single
# transaction.
map_size = X.nbytes * 10

env = lmdb.open('mylmdb', map_size=map_size)

with env.begin(write=True) as txn:
    # txn is a Transaction object
    for i in range(N):
        datum = caffe.proto.caffe_pb2.Datum()
        datum.channels = X.shape[1]
        datum.height = X.shape[2]
        datum.width = X.shape[3]
        datum.data = X[i].tobytes()  # or .tostring() if numpy < 1.9
        datum.label = int(y[i])
        str_id = '{:08}'.format(i)

        # The encode is only essential in Python 3
        txn.put(str_id.encode('ascii'), datum.SerializeToString())

You can also open up and inspect an existing LMDB database from Python:

import numpy as np
import lmdb
import caffe

env = lmdb.open('mylmdb', readonly=True)
with env.begin() as txn:
    raw_datum = txn.get(b'00000000')

datum = caffe.proto.caffe_pb2.Datum()
datum.ParseFromString(raw_datum)

flat_x = np.fromstring(datum.data, dtype=np.uint8)
x = flat_x.reshape(datum.channels, datum.height, datum.width)
y = datum.label

Iterating <key, value> pairs is also easy:

with env.begin() as txn:
    cursor = txn.cursor()
    for key, value in cursor:
        print(key, value)