When I learned data analysis a long time ago, the number of samples of data was from 100 to 1,000. Because teachers should explain what the data are in the details. There were a little parameters that was calculated, too. Therefore, most of statistical tools could handle these data within a reasonable time. Even spread sheets worked well. There are huge volume data, however, and there are more than 1,000 or10,000 parameters that should be calculated now. We have problems to analyze data because It takes too long to complete the analysis and obtain the results. This is the problem in the age of big data.
This is one of the biggest reasons why new generation tools and languages of machine learning appear in the market. Torch became open sourced from Facebook at January 2015. H2O 3.0 was released as open source in May 2015 and TensorFlow was also released from Google as open source in this month. Each language explains itself as “very fast” language.
Let us consider each of the latest languages. I think each language puts importance into the speed of calculations. Torch uses LuaJIT+C, H2O uses Jave behind it. TensorFlow uses C++. LuaJIT , Java and C++ are usually much faster compared to script languages such as python or R. Therefore new generation languages must be faster when big data should be analyzed.
Last week, I mentioned deep learning by R+H2O. Then let me check how fast H2O runs models to complete the analysis. This time, I use H2O FLOW, an awesome GUI, shown below. The deep learning model runs on my MAC Air11 (1.4 GHz Intel Core i5, 4GB memory, 121GB HD) as usual. Summary of the data used as follows
- Data: MNIST hand-written digits
- Training set : 19000 samples with 785 columns
- Test set : 10000 samples with 785 columns
Then I create the deep learning model with three hidden layers and corresponding units (1024,1024,2048). You can see it in red box here. It is a kind of complex model as it has three layers.
It took just 20 minutes to complete. It is amazing! It is very fast, despite the fact that deep learning requires many calculations to develop the model. If deep learning models can be developed within 30 minutes, we can try many models at different setting of parameters to understand what the data means and obtain insight from them.
I did not stop running the model before it fitted the data. These confusion matrices tell us error rate is 2.04 % for training data (red box) and 3.19 % of test data (blue box). It looks good in term of data fitting. It means that 20 minutes is enough to create good models in this case.
Now it is almost impossible to understand data by just looking at them carefully because it is too big to look at with our eye. However, through analytic models, we can understand what data means. The faster analyses can be completed, the more insight can be obtained from data. It is wonderful for all of us. Yes, we can have an enough time to enjoy coffee and cakes with relaxing after our analyses are completed!
Note: Toshifumi Kuga’s opinions and analyses are personal views and are intended to be for informational purposes and general interest only and should not be construed as individual investment advice or solicitation to buy, sell or hold any security or to adopt any investment strategy. The information in this article is rendered as at publication date and may change without notice and it is not intended as a complete analysis of every material fact regarding any country, region market or investment.
Data from third-party sources may have been used in the preparation of this material and I, Author of the article has not independently verified, validated such data. I and TOSHI STATS.SDN.BHD. accept no liability whatsoever for any loss arising from the use of this information and relies upon the comments, opinions and analyses in the material is at the sole discretion of the user.