This course is the best for beginners of data analysis. It is free, too!

shield-229112_1280

Last week, I started learning on-line course about data analysis. It is “The Analytics Edges” in edx, one of the biggest platforms of MOOCs all over the world (www.edx.org).  This course says “Through inspiring examples and stories, discover the power of data and use analytics to provide an edge to your career and your life.”   Now I completed Unit one and two out of  total nine in the course and found that it is the best course for beginners of data analysis in MOOCs. Let me tell you why it is.

 

1. There are a variety of data sets to analyze

When you start learning data analysis, data is very important to motivate yourself to continue to learn.  When you are sales personnel, sales data is the best to learn data analysis because you are interested in sales as professional.  When you are in financial industries, financial data is the best for you.   This course uses a variety of data from crime rate to automobile sales.  Therefore, you can see the data you are interested in. It is critically important for beginners of data analysis.

 

2. This course focuses on how to use analytics tools, quite than the theory behind the analysis

Many of data analysis courses take a long time to explain the theory behind the analysis.  It is required when you want to be a data scientist because theory is needed to construct an analytic method by yourself. However, most of business managers do not want to be data scientists.  All business managers need is the way to analyze data to make better business decisions. For this purpose, this course is good and well-balanced between theory and practice.  Firstly, a short summary of theory is provided, then move on to practice. Most of  the lectures focus on “how to use R for data analysis”. R is one of the famous programming languages for data analysis, which is free for everyone.  It enables beginners to use R in analyzing data step by step.

 

3. It covers major analytic methods of data analysis.

When you see the schedule of the course,  you find many analytic methods from linear regression to optimizations.  This course covers major methods that beginners must know.  I recommend to focus on linear regression and logistic regression when you do not have enough time to compete all units because both of method is applicable to many cases in the real world.

 

 

I think it is worth seeing only the video in Unit 1 and 2.  Interesting topics are used especially for people who like baseball. If you do not have enough time to learn R programming, it is OK to skip it. The story behind the analysis is very good and informative for beginners. So you may enjoy the videos about the story and skip videos of programming for the first time. If you try to obtain a certificate from edx, you should obtain 55% at least over the homework, competition and final exam.  For beginners, it may be difficult to complete the a whole course within limited time (three-month).  Do not worry.  I think this course can be learned again in time to come.  So first time,  please focus on Unit1 and Unit2, then a second time, try a whole course if  you can. In addition, most of edx courses including this are free for anyone.   You can enjoy anytime, anywhere as long as you have an internet access.  Could you try this course with me (www.toshistats.net) ?

Advertisements

Last week I held a seminar in Tokyo. Everyone is active in learning data analysis.

meeting-469574_1280

Last week I had a business trip to Tokyo and held a seminar about data analysis for businessmen/women.  There were around ten participants in this seminar and  everyone is a young business men/women, not a data scientist.  Title is “How to predict price of wine”. Based on the data about temperatures and amount of rain, price of wine may be predicted by using liner regression model. This analysis was explained in my blog before.

 

This seminar was about one and half hour. During the seminar I felt that every participant is interested in data analysis very much.  I had a lot of questions about data analysis from them.  I think they face problems with a lot of data on a daily basis. Unfortunately it is not easy to analyse data so that better business decisions can be made according to the results of analysis. I hope I can provide clues to solve the problems for participants.

 

In my seminar, I focused on liner regression model. There are three reasons why I choose this model.

1.  It is the simplest model in data analysis.

It uses inner product of parameters and explanatory variables. This is very simple and easy to understand, however,  it appears many times in statistical models and machine learning. Once participants can be familiar with inner product, they can apply it to  more complex models.

 

2.  Liner regression model can be basis in learning more complex models.

Although liner regression model is simple, it can be extended to more complex models. For example, logistic regression model uses same inner product as liner regression model. Structures of these two models are similar each other. Neural network can be expressed as layers of logistic regression model.  Support vector machine uses inner product of parameters and explanatory variables, too.

 

3.  Method to obtain parameters can be expanded to more complex models.

Obtaining parameters is a key point to use statistical models effectively.  In order to do that, least squares method is used in liner regression model.  This method can be expanded to Maximum likelihood estimation, which is used in logistic regression model.

 

If you are a beginner of data analysis,  I would like to recommend to learn liner regression model as the first step of data analysis. Once you can understand liner regression model, it enable you to understand more complex models. Anyway, let us start learning liner regression model. Good luck!

What is a utility function in recommender systems?

shirts-428600_1280

Let us go back to recommender systems as I did not mention last week.   Last month I found that customers’ preference and items features are key to provide recommendations. Then I started developing the model used in recommender systems.  Now I think I should explain the initial problem setting in recommender systems.  This week I looked at “Mining Massive datasets” in Coursera and I found that problem setting of recommender systems in this course is simple and easy to understand.  So I decided to follow this. If you are interested in this more detail,  I recommend to look at this course, excellent MOOCs in Coursera.

 

Let us introduce a utility function, which tells us how customers are satisfied with the items. The term of “utility function ” is coming from micro economics. So some of you may learn it before.  I think it is good to use a utility function here because we can use the method of economics when we analyze the impacts of recommender systems to our society going forward.  I hope  more people, who are not data-scientists, are getting interested in recommender systems.

The utility function is expressed as follows

U*x→R

U:utility of customers,  θ:customers’preferences,  x:Item features,  R:ratings of the items for the customers

This is simple and easy to understand what utility function is.  I would like to use this definition going forward. I think ratings may be one, two, three…, or it may be a continuous number according to recommender systems.

When we look at the simple models, such as linear regression model and logistic regression model,  Key metrics are explanatory variables or features and its weight or parameters. It is represented as x and θ respectively.  And product of θx shows us how much it has an impact on variables, which we want to predict. Therefore I would like to introduce θx as a critical part of my recommender engine.   ”θx” means that each x is multiplied to it’s correspondent weight θ and summing up all products .This is critically important for recommender systems. Mathematically θx is calculations of products of vectors/matrices. It is simple but has a strong power to provide recommendations effectively. I would like to develop my recommender engine by using θx next week.

 

Yes, we should consider what color of shirts maximize our utility functions, for example.  In futures, utility functions of every person might be stored in computers and recommendations might be provided automatically in order to maximize our utility functions. So everyone may be satisfied with everyday life. What a wonderful world it is!

How can we predict the price of wine by data analysis ?

restaurant-386820

Before discussing the models in details,  It is good to explain how models work in general, so that beginners for data analysis can understand models. I select one of the famous research of wine quality and price by  Orley Ashenfelter , a professor of economics at Princeton university. You can look at the details of the analysis on this site. This is “BORDEAUX WINE VINTAGE QUALITY AND THE WEATHER”. I calculated them by myself in order to explain how models work in data analysis.

 

1. Gathering data

Quality and price of wine are closely related to the quality of the grapes.  So it is worth considering what factors impact the quality of the grapes.  For example, Temperatures, quantities of rain, the skill of the farmers, the quality of vineyards may be candidates of the factors. Historical data of each factor for more than 40 years are needed in this analysis. It is sometimes very important in practice, whether data is available for longer periods. Here is the data used in this analysis.  So you can do it by yourself  if you want to.

 

2.  Put data into models

Once data is prepared, I input the data into my model in R. This time, I use linear regression model, which is  one of the simplest models.  This model can be expressed by the products of explanatory variables and parameters. According to the web sites,  explanatory variable as follows

       WRAIN      1  Winter (Oct.-March) Rain  ML                
       DEGREES    1  Average Temperature (Deg Cent.) April-Sept.   
       HRAIN      1  Harvest (August and Sept.) ML               
       TIME_SV    1  Time since Vintage (Years)

This is RStudio, famous Integrated Development Environment for R.   Upper left side of RStudio, I developed the function with linear regression “lm” with R and data from the web site is input into the model.

スクリーンショット 2014-09-30 14.05.37

 

3. Examine the outputs from models

In order to predict wine price,  parameters should be obtained firstly.  There is no need to worry about.  R can calculate this automatically. The result is as follows. Coefficients mean parameters here. You can see this result in lower left side of RStudio.

Coefficients:

(Intercept)     WRAIN    DEGREES   HRAIN    TIME_SV
-12.145007   0.001167   0.616365   -0.003861   0.023850

Finally, we can predict wine price. You can see the predictions of wine price in lower right side of RStuido.

Rplot

This graph shows the comparison between the predictions of wine price and real price of wine.  Red square tells us predictions of price and Blue circle tells real price. These are relative prices against an average price in 1961.  So the real price in 1961 is 1.0.  It seems that the model works well.  Of course it may not work now as it was made more than 20 years ago. But it is good to learn how models work and predictions are made by this research. Once you can understand the linear regression model, it enables you to understand other complex models with ease. I hope you can enjoy prediction of wine price. OK, let us move on recommender engines again next week !

 

Notice: TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using materials, instructions, methods, algorithm or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.

Credit Risk Management and Machine Learning in future

macbook-336704_640

Credit risk is always s hot topic in economic news.   Are US Treasury bonds safe? Does the Chinese banking system have problems?   Is the ratio of household debt to GDP is high in Malaysia?  etc. When we think about credit risk, it is critically important to know how we measure credit risk. So I would like to reconsider how we can apply Machine learning to credit risk management in financial institutions.  When I used to be a credit risk manager in a Japanese consumer finance company 10 years ago,  there were no cloud service and no social media. However, now that it is the time of big data,  it is a good timing to reconsider how we can manage credit risk by using the latest technologies such as cloud services and Machine learning. Let me take three points as follows.

 

1. Statistical models

One of the key metrics in credit risk management is the probability of  defaults (PD).  It is usually calculated from statistical models such as regression analysis.  Machine learning has algorithms about regression analysis, therefore it must  be easy to implement Machine learning to credit risk management systems in financial intuitions. Once PD is calculated from Machine learning,  this figure can be used the same way as current practices in financial institutions.  Statistical models usually have versions since it is developed but there is no need to worry about version control because it is easy on the cloud system.

 

2. Data

Data is more different from one which used to be 10 years ago.   I used private and closed data only in order to calculate the risk for each borrower.  Now that social media data and open public data are available in risk management,  we need to consider how to use these kind of data in practice. For this purpose,  cloud services are good because they are scalable and easy to expand the capacity to store data whenever it is needed.

 

3.  Product development

Recently product development is fast and active to keep the competitive edge as customer can choose channels, to contact with financial institutions and are getting more demanding.  That is why risk management in financial institutions should be flexible to update product portfolio and adjust methods in the light of  characteristics of new products.  Combination of cloud service and machine learning enable us to develop the risk models, enough to quick to keep up with new products and change of business environments.

 

Unlike retail industries,  financial industries are heavily regulated and required to audit risk management systems periodically. Therefore, audit trails are also important when Machine learning is applied to credit risk management.  I think combination cloud services and machine learning is good to enhance credit risk management with cost-effectiveness  in the long run. I would like to try this combination in credit risk management  if I can do it.

Challenge to Machine Learning

home-office-336373_640

Machine Learning is getting famous and attractive to analyze big data.  It has a long history to be developed as the algorithm since 1950s.  However machine learning gets a spotlight among data scientist recently because  a lot of data,  computer resources and data storage, which are necessary for machine learning, has been available with reasonable costs.    I would like to introduce a basic algorithm of Machine Learning by using R language,  which I recommended before.

1.  Problem sets

Observed data x= [1,2,3] and y= [5,7,9].  Then I would like to find what are a and b when I assume that it can be expressed  y=ax+b.  Yes, it is obvious that a=2 and b=3, however, I want this solution by using algorithms to calculate them.

 

2. Algorithm

This is my program of machine learning to find what  a and b are.  I would like to focus on Bold part of the program.

First step      :       update parameters

Second step :       calculate the updated value of the cost function by the updated parameters

Third step    :       compare the updated value with the old value of the cost function and stop calculation if it is considered as convergence

Go back to the first step above.

These three steps above are generally used in machine learning algorithms. So it is useful if you can remember them.

 

ML<-function(le){

x=matrix(c(1,1,1,1,2,3),3,2)
y=matrix(c(5, 7, 9),3,1)
t=matrix(1,2,1)
m=length(y)
h=x%*%t
j=1/(2*m)*(t(h-y)%*%(h-y))

for (i in seq(1,1000)){
h=x%*%t
tnew=t-le/m*t(x)%*%(h-y)
hnew=x%*%tnew
jnew=1/(2*m)*(t(hnew-y)%*%(hnew-y))
if (abs(jnew-j)<=10^(-8)) break
t=tnew
j=jnew
print(i)
print(t)
end
}
}

 

3.  The result of calculation

I use  le=0.1 as a learning rate.  Then I get the result of the calculation below.

[1] 521

[,1]
[1,] 2.997600
[2,] 2.001056

This means that the value of the cost function is convergent at 521 time calculations.  a = 2.001056 and b =  2.997600.    They are very close to true values a=2 and b=3.  So it is considered that this algorithm can find the solutions.

 

This algorithm is one of the most simple ones. However, it includes the fundamental structure which can be applied to other complex algorithms.  So I recommend you to implement this by yourself and be familiar with this kind of algorithms.  In short, 1. Update parameters  2. Calculate the updated value of cost function  3. Make sure updated value is convergent.   Yes, it is so simple!

TOSHI STATS SDN. BHD. and I do not accept any responsibility or liability for loss or damage occasioned to any person or property through using algorithms, instructions, methods or ideas contained herein, or acting or refraining from acting as a result of such use. TOSHI STATS SDN. BHD. and I expressly disclaim all implied warranties, including merchantability or fitness for any particular purpose. There will be no duty on TOSHI STATS SDN. BHD. and me to correct any errors or defects in the codes and the software.

Do you want to be “Analytic savvy manager”?

Data is and will be around us and it is increasing at an astonishing rate.  In the such business environment,  what should business managers do?    I do no think every manager should have an analytics skill at the same level as a data scientist because it is almost impossible.  However, I do think every manager should communicate with data scientists and make better decisions by using output from their data analysis.  Sometimes this kind of manager is called “analytic savvy manager”.  Let us consider what “analytic savvy manager”should know.

 

1.   What kind of data is available to us?

Business managers should know what kind of data is available for their business analysis.  Some of them are free and others are not.  Some of them are in  companies or private and others are public.  Some of them are structured and others are not.  It is noted that data which are available is increasing in terms of volume and variety.  Data is a starting point of analysis, however, data scientists may not know specific fields of business in detail. It is business managers that know what data is available to businesses.  Recently data gathering services have provided us a lot of data for free.  I recommend you to look at “Quandl” to find public data. It is easy to use and provides a lot of public data for free.  Strong recommendation!

 

2.  What kind of analysis method can be applied?

Business managers do not need to memorize formulas of each analysis method.  I recommend business managers to understand simple linear regression and logistic regression and get the big picture about how the statistical models work. Once you are familiar with two methods,  you can understand other complex statistical models with ease because fundamental structures are not so different among methods.  Statistical models enable us to understand what big data means without loss of information.  In addition to that,  I also recommend business managers to keep in touch with the progress of machine learning,  especially deep learning.  This method has great performances and is expected to be used in a lot of business field such as natural language processing.  It may change the landscape of businesses going forward.

 

3.  How can output from analysis be used to make better decisions?

This is critically important to make a better decision.  Output of data analysis should be in aligned with business needs to make decisions.  Data scientist can guarantee whether numbers of the output are accurate in terms of calculations. However, they can not guarantee whether it is relevant and useful to make better decisions.  Therefore business managers should communicate with data scientist during the process of data analysis and make the output of analysis relevant to business decisions. It is the goal of data analysis.

 

I do not think these points above are difficult to understand for business managers even though they do not have a quantitative analytic background.   If you are getting familiar with these points above, it would make you different from others at the age of big data.

Do you want to be “Analytic savvy manager”?

13943376615435