“Speed” is the first priority of data analysis in the age of big data

cake-219595_1280

When I learned data analysis a long time ago,  the number of samples of data was from 100 to 1,000. Because teachers should explain what the data are in the details.  There were  a little parameters that was calculated, too.  Therefore, most of statistical tools could handle these data within a reasonable time.  Even spread sheets worked well.  There are huge volume data,  however,  and there are more than 1,000 or10,000 parameters that should be calculated now.  We have problems to analyze data because It takes too long to complete the analysis and obtain the results.  This is the problem in the age of big data.

This is one of the biggest reasons why new generation tools and languages of machine learning appear in the market.  Torch became open sourced from Facebook at January 2015.  H2O 3.0 was released as open source in May 2015 and TensorFlow was also released from Google as open source in this month.  Each language explains itself as “very fast” language.

 

Let us consider each of the latest languages.  I think each language puts importance into the speed of calculations.  Torch uses LuaJIT+C, H2O uses Jave behind it.  TensorFlow uses C++. LuaJIT , Java and C++ are usually much faster compared to script languages such as python or R. Therefore new generation languages must be faster when big data should be analyzed.

Last week, I mentioned deep learning by R+H2O.  Then let me check how fast H2O runs models to complete the analysis.  This time, I use H2O FLOW,  an awesome GUI,  shown below.  The deep learning model runs on my MAC Air11  (1.4 GHz Intel Core i5, 4GB memory, 121GB HD) as usual.  Summary of the data used  as follows

  • Data: MNIST  hand-written digits
  • Training set : 19000 samples with 785 columns
  • Test set : 10000 samples with 785 columns

Then I create the deep learning model with three hidden layers and corresponding units (1024,1024,2048).  You can see it in red box here. It is a kind of complex model as it has three layers.

DL MNIST1 model

It took just 20 minutes to complete. It is amazing!  It is very fast, despite the fact that  deep learning requires many calculations to develop the model.  If deep learning models can be developed within 30 minutes,  we can try many models at different setting of parameters to understand what the data means and obtain insight from them.

DL MNIST1 time

I did not stop running the model before it fitted the data.  These confusion matrices tell us error rate is 2.04 % for training data (red box) and 3.19 % of test data (blue box). It looks good in term of  data fitting.  It means that 20 minutes is enough to create good models in this case.

DL MNIST1 cm

 

Now it is almost impossible to understand data by just looking at them carefully because it is too big to look at with our eye. However,  through analytic models, we can understand what data means. The faster analyses can be completed,  the more  insight can be obtained from data. It is wonderful for all of us.  Yes, we can have an enough time to enjoy coffee and cakes with relaxing after our analyses are completed!

 

 

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy.  The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user. 

 

Do it yourself for programming of image recognition. It works!

cat-205757_640Recently, Facebook, Pinterest and Instagram have gotten very popular.  A lot of pictures and images are generated and sent by users.  From human faces to landscape, there are a lot of varieties of pictures on them.  In order to enhance their services,  image recognition technology has been developed at the astonishing rate.  By this technology, computers can understand what the objects in images are.  Today,  I would like to re-create the simple image recognition by just following the tutorials on the web.

Image recognition can be done by the state of the art “deep learning”.  This is  one of the latest iterations of computer programming. It sounds so complicated that business personnel may not want to do that by themselves.  However,  specific programming languages for deep learning are provided as open source and good tutorials are also available on the web,  it is possible that the business persons  program simple image recognition by themselves even though  they may have no expertise in computer science. Let me tell you my experience of that.

 

1. Choose programming languages

There are several programming languages for deep learning. I choose “Torch” is provided Facebook artificial intelligence research as it becomes open source at the beginning of this year. I think it is easy to learn for beginners.

 

2.  Find good tutorials for the theory

In order to understand what the theory is behind image recognition,   I find the best tutorials and lectures provided by the Computer Science Department of University of Oxford 1 .  This is a good reference to understand what deep learning is and its applications.  Even though the theory is not always required for programming,  it is recommended to watch the tutorials before programming in order to grasp broad pictures of image recognition.

 

3.  Let us program image recognition and find what computer says

Programm itself is provided by the tutorial 2.  In the tutorial I use image dataset, which has the classes: ‘airplane’, ‘automobile’, ‘bird’, ‘cat’, ‘deer’, ‘dog’, ‘frog’, ‘horse’, ‘ship’, ‘truck’.  So computer should classify each image into one of 10 classes above. I just copy and past programs which are provided in the tutorial.  It takes less that 10 minutes. I run the program and obtain the results. Then I choose three of the results and see what the computer says. Name of objects above images are correct answers.  The computer  provides its answers as the probability of the each class.  Therefore sum of the 10 numbers below is close to “1”.

スクリーンショット 2015-08-04 15.59.42

In this result, the correct answer is “frog”.  In computer answer, frog has the highest probability of 0.4749….  So  the computer has a good guess!

 

スクリーンショット 2015-08-04 15.58.41

In this result, the correct answer is “cat”.  In computer answer, cat has the highest probability of 0.3508….  So  the computer has a good guess!

 

スクリーンショット 2015-08-04 16.00.08

In this result, the correct answer is “automobile”.  In computer answer, automobile has the highest probability of 0.3622….  So  the computer has a good guess!  Although this program is not perfect in terms of accuracy of whole test results, it is reasonable to learn programming of image recognition.

 

You may not be  a computer scientist.  However, it is good to program this image recognition by themselves because it enables you to understand how it works based on the state of art deep learning.  Once you do it,  you do not need to consider image recognition as “Black box”.  It is beneficial for you at the age of the digital economy.

Yes, torch and the tutorials are free.  No fee is required.  Could you try it as your hobby?

 

Source

1.   Machine Learning: 2014-2015,  Nando de Freitas, the Computer Science Department of University of Oxford https://www.cs.ox.ac.uk/people/nando.defreitas/machinelearning/

2.  Deep Learning with Torch – A 60-minute blitz

https://github.com/soumith/cvpr2015/blob/master/Deep%20Learning%20with%20Torch.ipynb

 

 

Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy.  The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.

Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.