Inception modules: explained and implemented

If you’re up to date with the AI world, you know that Google recently released a model called Inception v3 with Tensorflow.

Figure 0. Inception v3 model comes with Tensorflow (image source)

You can use this model by retraining the last layer per your classification requirements.  If you’re interested in doing this, check out this speedy 5 min tutorial by Siraj Raval.  The namesake of Inception v3 is the Inception modules it uses, which are basically mini models inside the bigger model.  The same Inception architecture was used in the GoogLeNet model which was a state of the art image recognition net in 2014.

Figure 1. GoogLeNet architecture with nine inception modules (image source)

In this post we’ll actually go into how to program Inception modules using Tensorflow following the details described by the original paper, “Going Deep with Convolutions.” If you’re not already comfortable with convolutional networks, check out Chapter 9 of the Deep Learning book, because we’ll assume that you already have a good understanding of convolutional operations and that you’ve coded some simpler convnets in Tensorflow.

What are Inception modules?

As is often the case with technical creations, if we can understand the problem that led to the creation, we will more easily understand the inner workings of that creation.

Believe it or not, Inception modules partially got their name from this meme (image source)

Udacity’s Deep Learning Course  did a good job introducing the problem and the  main advantages of using Inception architecture, so I’ll try to restate them here.  The inspiration comes from the idea that you need to make a decision as to what type of convolution you want to make at each layer:  Do you want a 3×3? Or a 5×5?  And this can go on for a while.

Figure 2. Udacity video explaining the motivation behind Inception (image source)

So why not use all of them and let the model decide? You do this by doing each convolution in parallel and concatenating the resulting feature maps before going to the next layer.

Now let’s say the next layer is also an Inception module.  Then each of the convolution’s feature maps will be passes through the mixture of convolutions of the current layer. The idea is that you don’t need to know ahead of time if it was better to do, for example, a 3×3 then a 5×5.  Instead, just do all the convolutions and let the model pick what’s best.  Additionally, this architecture allows the model to recover both local feature via smaller convolutions and high abstracted features with larger convolutions.


Now that we get the basic idea, let’s look into the specific architecture that we’ll implement.  Figure 3 shows the architecture of a single inception module.

Figure 3. Inception module from “Going Deeper with Convolutions.” (image source)

Notice that we get the variety of convolutions that we want; specifically, we will be using 1×1, 3×3, and 5×5 convolutions along with a 3×3 max pooling.  If you’re wondering what the max pooling is doing there with all the other convolutions, we’ve got an answer: pooling is added to the Inception module for no other reason than, historically, good networks having pooling.  The larger convolutions are more computationally expensive, so the paper suggests first doing a 1×1 convolution  reducing the dimensionality of its feature map, passing the resulting feature map through a relu, and then doing the larger convolution (in this case, 5×5 or 3×3). The 1×1 convolution is key because it will be used to reduce the dimensionality of its feature map.  This is explained in detail in the next section.

Dimensionality reduction

This was the coolest part of the paper.  The authors say that you can use 1×1 convolutions to reduce the dimensionality of your input to large convolutions, thus keeping your computations reasonable.  To understand what they are talking about, let’s first see why we are in some computational trouble without the reductions.

Let’s say we use, we the authors call, the naive implementation of an Inception module.

Figure 4. Naive Inception module (image source)

Figure 4 shows an Inception module that’s similar to the one in Figure 3, but it doesn’t have the additional 1×1 convolutional layers before the large convolutions (3×3 and 5×5 convolutions are considered large).

Let’s examine the number of computations required of the first Inception module of GoogLeNet. The architecture for this model is tabulated in Figure 5.

Figure 5. GoogLeNet architecture with the previous layer dimensions and the Inception module of interest highlighted in green and red, respectively

We can tell that the net uses same padding for the convolutions inside the module, because the input and output are both 28×28.  Let’s just examine what the 5×5 convolution would be computationally if we didn’t do the dimensionality reduction.  Figure 6. pictorially shows these operations.

Figure 6. 5×5 convolutions inside the Inception module using the naive model

There would be

5^2(28)^2(192)(32)=120,422,400 \text{ operations.}

Wow. That’s a lot of computing! You can see why people might want to do something to bring this number down.

To do this, we will ditch the naive model shown in Figure 4 and use the model from Figure 3.  For our 5×5 convolution, we put the previous layer through a 1×1 convolution that outputs a 16 28×28 feature maps (we know there are 16 from the #5×5 reduce column in Figure 5), then we do the 5×5 convolutions on those feature maps which outputs 32 28×28 feature maps.

Figure 7. 1×1 convolutions serve as the dimensionality reducers that limit the number of expensive 5×5 convolutions that follow

In this case, there would be

[(1^2)(28^2)(192)(16)]+[(5^2)(28^2)(16)(32)]=2,408,448+10,035,200= 12,443,648\text{ operations.}

Although this is a still a pretty big number, we shrunk the number of computation from the naive model by a factor of ten.

We won’t run through the calculations for the 3×3 covolutions, but they follow the same process as the 5×5 convolutions.  Hopefully, this sections cleared up why the 1×1 convolutions are necessary before large convolutions!

That wraps up the specifics of the Inception modules and hopefully clears up any ambiguity.  Now time to code our model.

Tensorflow implementation

Coding this up in Tensorflow is more straightforward than elegant.  For data, we’ll use the classic MNIST dataset.  We didn’t use the typical crazily formatted version, but found this neat site that has the csv version.  The architecture of our model will include two Inception modules and one fully connected hidden layer before our output layer.  You’ll definitely want to use your GPU to run the code, or else it’ll will take hours to days to train.  If you don’t have a GPU, you can check to see if your model works by using just a couple hundred training steps.  Depending on your computer, you might get a resource exhaust error and will have to shrink some of the parameters to get the code to run; in fact, I wasn’t able to use the parameters described in the paper which is why mine are smaller. On the other hand, if your machine can handle more parameters, you’ll be able to make your network wider and/or deeper.

Imported libraries

We’ll use pandas and numpy to preprocess the data and Tensorflow for the neural net.

import pandas as pd
import numpy as np
import tensorflow as tf

Data and preprocessing

Use pandas to import the csv.

train_set = pd.read_csv('mnist_train.csv',header=None)
test_set = pd.read_csv('mnist_test.csv',header=None)

The images, represented by rows in the table, need to be separated from the labels. Additionally, The labels need to be one-hot encoded.

 #get labels in own array

#one hot encode the labels
train_lb=(np.arange(10) == train_lb[:,None]).astype(np.float32)
test_lb=(np.arange(10) == test_lb[:,None]).astype(np.float32)

#drop the labels column from training dataframe

#put in correct float32 array format

Next, reformat the data so its a 4D tensor.

#reformat the data so it's not flat
testX = testX.reshape(len(testX),28,28,1)

It’ll be good to have a validation set so we can monitor how training is going.

#get a validation set and remove it from the train set

I ran into memory issues when trying to test all my data at once, so I created this class that’ll batch data.

#need to batch the test data because running low on memory
class test_batchs:
    def __init__(self,data): = data
        self.batch_index = 0
    def nextBatch(self,batch_size):
        if (batch_size+self.batch_index) >[0]:
            print "batch sized is messed up"
        batch =[self.batch_index:(self.batch_index+batch_size),:,:,:]
        self.batch_index= self.batch_index+batch_size
        return batch

#set the test batchsize
test_batch_size = 100

Before we start building models, create a function that’ll compute the accuracy from predictions and corresponding labels.

#returns accuracy of model
def accuracy(target,predictions):
    return(100.0*np.sum(np.argmax(target,1) == np.argmax(predictions,1))/target.shape[0])


It’s good to save our model after training so we can continue training without starting from scratch. This specific command will only work if you’re running linux.

#use os to get our current working directory so we can save variable there
file_path = os.getcwd()+'/model.ckpt'

Set hyperparameters.

  • batch_size: training batch size
  • map1: number of feature maps output by each tower inside the first Inception module
  • map2: number of feature maps output by each tower inside the second Inception module
  • num_fc1: number of hidden nodes
  • num_fc2: number of output nodes
  • reduce1x1: number of feature maps output by each 1×1 convolution that precedes a large convolution
  • dropout: dropout rate for nodes in the hidden layer during training

batch_size = 50
map1 = 32
map2 = 64
num_fc1 = 700 #1028
num_fc2 = 10
reduce1x1 = 16

Time for the bulk of the work, which will require Tensorflow.   Once the graph is defined, create placeholders that hold the training data, training labels, validation data, and test data. Then create some helper functions which assist in defining tensors, 2D convolutions, and max pooling.  Next, use the helper functions and hyperparameters to create variables in both Inception modules.  Then, create another function that  takes data as input and passes it through the Inception modules and fully connected layers and outputs the logits.  Finally, define the loss to be cross-entropy, use Adam to optimize, and create ops for converting data to predictions, initializing variables, and saving all variables in the model.

graph = tf.Graph()
with graph.as_default():
    #train data and labels
    X = tf.placeholder(tf.float32,shape=(batch_size,28,28,1))
    y_ = tf.placeholder(tf.float32,shape=(batch_size,10))

    #validation data
    tf_valX = tf.placeholder(tf.float32,shape=(len(valX),28,28,1))

    #test data

    def createWeight(size,Name):
        return tf.Variable(tf.truncated_normal(size, stddev=0.1),

    def createBias(size,Name):
        return tf.Variable(tf.constant(0.1,shape=size),

    def conv2d_s1(x,W):
        return tf.nn.conv2d(x,W,strides=[1,1,1,1],padding='SAME')

    def max_pool_3x3_s1(x):
        return tf.nn.max_pool(x,ksize=[1,3,3,1],

    #Inception Module1
    #follows input
    W_conv1_1x1_1 = createWeight([1,1,1,map1],'W_conv1_1x1_1')
    b_conv1_1x1_1 = createWeight([map1],'b_conv1_1x1_1')

    #follows input
    W_conv1_1x1_2 = createWeight([1,1,1,reduce1x1],'W_conv1_1x1_2')
    b_conv1_1x1_2 = createWeight([reduce1x1],'b_conv1_1x1_2')

    #follows input
    W_conv1_1x1_3 = createWeight([1,1,1,reduce1x1],'W_conv1_1x1_3')
    b_conv1_1x1_3 = createWeight([reduce1x1],'b_conv1_1x1_3')

    #follows 1x1_2
    W_conv1_3x3 = createWeight([3,3,reduce1x1,map1],'W_conv1_3x3')
    b_conv1_3x3 = createWeight([map1],'b_conv1_3x3')

    #follows 1x1_3
    W_conv1_5x5 = createWeight([5,5,reduce1x1,map1],'W_conv1_5x5')
    b_conv1_5x5 = createBias([map1],'b_conv1_5x5')

    #follows max pooling
    W_conv1_1x1_4= createWeight([1,1,1,map1],'W_conv1_1x1_4')
    b_conv1_1x1_4= createWeight([map1],'b_conv1_1x1_4')

    #Inception Module2
    #follows inception1
    W_conv2_1x1_1 = createWeight([1,1,4*map1,map2],'W_conv2_1x1_1')
    b_conv2_1x1_1 = createWeight([map2],'b_conv2_1x1_1')

    #follows inception1
    W_conv2_1x1_2 = createWeight([1,1,4*map1,reduce1x1],'W_conv2_1x1_2')
    b_conv2_1x1_2 = createWeight([reduce1x1],'b_conv2_1x1_2')

    #follows inception1
    W_conv2_1x1_3 = createWeight([1,1,4*map1,reduce1x1],'W_conv2_1x1_3')
    b_conv2_1x1_3 = createWeight([reduce1x1],'b_conv2_1x1_3')

    #follows 1x1_2
    W_conv2_3x3 = createWeight([3,3,reduce1x1,map2],'W_conv2_3x3')
    b_conv2_3x3 = createWeight([map2],'b_conv2_3x3')

    #follows 1x1_3
    W_conv2_5x5 = createWeight([5,5,reduce1x1,map2],'W_conv2_5x5')
    b_conv2_5x5 = createBias([map2],'b_conv2_5x5')

    #follows max pooling
    W_conv2_1x1_4= createWeight([1,1,4*map1,map2],'W_conv2_1x1_4')
    b_conv2_1x1_4= createWeight([map2],'b_conv2_1x1_4')

    #Fully connected layers
    #since padding is same, the feature map with there will be 4 28*28*map2
    W_fc1 = createWeight([28*28*(4*map2),num_fc1],'W_fc1')
    b_fc1 = createBias([num_fc1],'b_fc1')

    W_fc2 = createWeight([num_fc1,num_fc2],'W_fc2')
    b_fc2 = createBias([num_fc2],'b_fc2')

    def model(x,train=True):
        #Inception Module 1
        conv1_1x1_1 = conv2d_s1(x,W_conv1_1x1_1)+b_conv1_1x1_1
        conv1_1x1_2 = tf.nn.relu(conv2d_s1(x,W_conv1_1x1_2)+b_conv1_1x1_2)
        conv1_1x1_3 = tf.nn.relu(conv2d_s1(x,W_conv1_1x1_3)+b_conv1_1x1_3)
        conv1_3x3 = conv2d_s1(conv1_1x1_2,W_conv1_3x3)+b_conv1_3x3
        conv1_5x5 = conv2d_s1(conv1_1x1_3,W_conv1_5x5)+b_conv1_5x5
        maxpool1 = max_pool_3x3_s1(x)
        conv1_1x1_4 = conv2d_s1(maxpool1,W_conv1_1x1_4)+b_conv1_1x1_4

        #concatenate all the feature maps and hit them with a relu
        inception1 = tf.nn.relu(tf.concat(3,[conv1_1x1_1,conv1_3x3,conv1_5x5,conv1_1x1_4]))

        #Inception Module 2
        conv2_1x1_1 = conv2d_s1(inception1,W_conv2_1x1_1)+b_conv2_1x1_1
        conv2_1x1_2 = tf.nn.relu(conv2d_s1(inception1,W_conv2_1x1_2)+b_conv2_1x1_2)
        conv2_1x1_3 = tf.nn.relu(conv2d_s1(inception1,W_conv2_1x1_3)+b_conv2_1x1_3)
        conv2_3x3 = conv2d_s1(conv2_1x1_2,W_conv2_3x3)+b_conv2_3x3
        conv2_5x5 = conv2d_s1(conv2_1x1_3,W_conv2_5x5)+b_conv2_5x5
        maxpool2 = max_pool_3x3_s1(inception1)
        conv2_1x1_4 = conv2d_s1(maxpool2,W_conv2_1x1_4)+b_conv2_1x1_4

        #concatenate all the feature maps and hit them with a relu
        inception2 = tf.nn.relu(tf.concat(3,[conv2_1x1_1,conv2_3x3,conv2_5x5,conv2_1x1_4]))

        #flatten features for fully connected layer
        inception2_flat = tf.reshape(inception2,[-1,28*28*4*map2])

        #Fully connected layers
        if train:
            h_fc1 =tf.nn.dropout(tf.nn.relu(tf.matmul(inception2_flat,W_fc1)+b_fc1),dropout)
            h_fc1 = tf.nn.relu(tf.matmul(inception2_flat,W_fc1)+b_fc1)

        return tf.matmul(h_fc1,W_fc2)+b_fc2

    loss = tf.reduce_mean(
    opt = tf.train.AdamOptimizer(1e-4).minimize(loss)

    predictions_val = tf.nn.softmax(model(tf_valX,train=False))
    predictions_test = tf.nn.softmax(model(tf_testX,train=False))

    #initialize variable
    init = tf.initialize_all_variables()

    #use to save variables so we can pick up later
    saver = tf.train.Saver()

To train the model, set the number of training steps, create a session, initialize variables, and run the optimizer op for each batch of training data.  You’ll want to see how your model is progressing, so run the op for getting your validation predictions every 100 steps.  When training is done, output the test data accuracy and save the model.  I also created a flag use_previous that allows you to load a model from the file_path to continue training.

num_steps = 20000
sess = tf.Session(graph=graph)

#initialize variables
print("Model initialized.")

#set use_previous=1 to use file_path model
#set use_previous=0 to start model from scratch
use_previous = 1

#use the previous model or don't and initialize variables
if use_previous:
    print("Model restored.")

for s in range(num_steps):
    offset = (s*batch_size) % (len(trainX)-batch_size)
    batch_x,batch_y = trainX[offset:(offset+batch_size),:],train_lb[offset:(offset+batch_size),:]
    feed_dict={X : batch_x, y_ : batch_y}
    _,loss_value =[opt,loss],feed_dict=feed_dict)
    if s%100 == 0:
        feed_dict = {tf_valX : valX},feed_dict=feed_dict)

        print "step: "+str(s)
        print "validation accuracy: "+str(accuracy(val_lb,preds))
        print " "

    #get test accuracy and save model
    if s == (num_steps-1):
        #create an array to store the outputs for the test
        result = np.array([]).reshape(0,10)

        #use the batches class

        for i in range(len(testX)/test_batch_size):
            feed_dict = {tf_testX : batch_testX.nextBatch(test_batch_size)}
  , feed_dict=feed_dict)

        print "test accuracy: "+str(accuracy(test_lb,result))

        save_path =,file_path)
        print("Model saved.")

It should take around one hour to train on a 1060 GPU, and you should have a test accuracy of 98.5%.  The accuracy is a little disappointing, because simpler convnets do just as well and take much less time to train, like  this example from Tensorflow; but MNIST is infamous for being too easy these days, and these modules earned their reputation from a much more difficult classification task (around 1,000 different labels).  I hope this post helped explain the magic behind Inception modules and/or that the code aided in your understanding.  The full notebook with all the code can be found on my  GitHub.

25 thoughts on “Inception modules: explained and implemented

  1. Hi, in the paper, the author mentioned about approximating an ‘optimal local sparse structure’ of a convolutional vision network using readily available ‘dense components’.
    In the above, what do ‘optimal local sparse structure’ and ‘dense components’ refers to, and how to they relate?

    Liked by 1 person

    1. Hi Leonard. This was the topic I was most confused about when reading the paper, but luckily the authors did a lot of explaining in section 3. The paper they referenced when talking about the motivation behind Inception Modules was about generative neural nets. Those authors found that their edges in their neural net ended up being very sparse, thus they believe that you can create a deep neural net that is very sparse and still be able to represent visual concepts. This would mean that most of the weights inside a deep classification neural net would be close to zero. So how do you do this? Well the authors came up with the inception module, which is very dense (you have 5×5, 3×3, and 1×1 convolutions happening at every layer of the network), but is able to recover the sparse structure of images after training by making most of the parameters near zero.

      If that explanation didn’t help, this is what I thought of the first time a read the paper: There are relatively few key features in most images that play a discriminatory role in deciding the class of the photo (most images have a lot of pixels that are not important; just think if you’d be able to recognize a dog if the picture had a green background or a blue background…). For example, we can identify with a high level of confidence the location of a face in an image just by initially looking for edges of the face and then within the boundaries of those edges looking for shadows (also edges) caused by a nose or an eye; this is what the Haar classifier does. Just image how an inception module could do just this. It could first find the the location of the edges with a 5×5 or 3×3 convolution, then it could check for a nose and two eyes in the next layer with the 3×3 and/or 1×1 convolutions, and finally it could give an output.

      Liked by 1 person

      1. I found the ‘optimal local sparse structure’ topic confusing too. Is it possible to extract a optimal local sparse structure nearer the top level of the architecture stack before the consolidation into a dense vector?


      2. Hi Henry. Would you elaborate on your question? Each Inception Module has a concatenation before being passed to the next layer, and the extraction I think is happening in the modules and between the modules.


      3. I’ll re-phrase the question. In VGG16/19, before the fully-connected layers, the ‘block5_pool’ is sparse. Is there an equivalent layer in InceptionV3 and if not, is there a way to create one with fine-tuning by adding a sparse generating layer and retraining just the last few layers?


      4. That’s a good question! From looking at the architecture, it doesn’t look like there’s anything special at the end of Inception V3 that would make it sparse. I haven’t checked out VGG16/19, but is there anything before ‘block5_pool’ that’s forcing the structure to be sparse?


  2. Here are the last few VGG19 layers:

    block4_pool Tensor(“MaxPool_3:0”, shape=(?, 14, 14, 512), dtype=float32)
    block5_conv1 Tensor(“Relu_10:0”, shape=(?, 14, 14, 512), dtype=float32)
    block5_conv2 Tensor(“Relu_11:0”, shape=(?, 14, 14, 512), dtype=float32)
    block5_conv3 Tensor(“Relu_12:0”, shape=(?, 14, 14, 512), dtype=float32)
    block5_pool Tensor(“MaxPool_4:0”, shape=(?, 7, 7, 512), dtype=float32)
    flatten Tensor(“Reshape_13:0”, shape=(?, ?), dtype=float32)
    fc1 Tensor(“Relu_13:0”, shape=(?, 4096), dtype=float32)
    fc2 Tensor(“Relu_14:0”, shape=(?, 4096), dtype=float32)
    predictions Tensor(“Softmax:0”, shape=(?, 1000), dtype=float32)

    I guess the ReLu activations are generating the sparsity.


    1. If you look at Figure 0 you’ll see the architecture for Inception V3. You can try to mimic what you pasted above in your own implementation of V3. I’m not sure how you’d use the pretrained version of V3, but if you study the slim library in TensorFlow you should be able to figure it out. Best of luck!


  3. That was great explanation. Thank you.
    I am a bit confused when you compute the number of operations for Inception(3a) . Why do you use 192 ?
    5*5*28*28*192*32 = 120,422,400
    I would instead do: 5*5*28*28*32 = 627,200
    since 192 feature maps will be summed up to 1*28*28 before convolved by 32*5*5 kernels.

    Thank you,


  4. if we apply 1×1 convolution on 192@28*28, how does that outputs a 16 28×28 feature maps and on 16*28*28, if we do the 5×5 convolutions on those feature maps how does that outputs 32 28×28 feature maps. Can you please explain this ?


  5. Hi, thank you for your tutorial. it is well explained. When I run you notebook, i get the following error. Can you please help me? I have already search on the net but I can’t find a solution. Thanks

    ValueErrorTraceback (most recent call last)
    in ()
    139 loss = tf.reduce_mean(
    –> 140 tf.nn.softmax_cross_entropy_with_logits(model(X),y_))
    141 opt = tf.train.AdamOptimizer(1e-4).minimize(loss)

    in model(x, train)
    111 #concatenate all the feature maps and hit them with a relu
    –> 112 inception1 = tf.nn.relu(tf.concat(3,[conv1_1x1_1,conv1_3x3,conv1_5x5,conv1_1x1_4]))

    /usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.pyc in concat(values, axis, name)
    1120 axis, name=”concat_dim”,
    1121 dtype=dtypes.int32).get_shape().assert_is_compatible_with(
    -> 1122 tensor_shape.scalar())
    1123 return identity(values[0], name=scope)
    1124 return gen_array_ops.concat_v2(values=values, axis=axis, name=name)

    /usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/tensor_shape.pyc in assert_is_compatible_with(self, other)
    846 “””
    847 if not self.is_compatible_with(other):
    –> 848 raise ValueError(“Shapes %s and %s are incompatible” % (self, other))
    850 def most_specific_compatible_shape(self, other):

    ValueError: Shapes (4, 50, 28, 28, 32) and () are incompatible


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s