In this notebook, we demonstrate the main statistical plots, tests and distributions In Euler. See the following reference for more information about the available functions. ../reference/statist Let us start with some descriptive statistics. This is not an introduction to statistics. So you might need some background to understand the details. Assume the following measurements. We wish to compute the mean value and the measured standard deviation.
>M=[1000,1004,998,997,1002,1001,998,1004,998,997]; ... mean(M), dev(M),
999.9 2.72641400622
We can compute the 0,0.25,0.5,0.75,1 quantiles of M, and plot it into a plot showing the quantiles.
>boxplot(quantiles(M)); title("Quantiles at 0,0.25,0.5,0.75,1"):
We compute the probability that a value is bigger than 1005, assuming the measured values and a normal distribution. We print the result in % with 2 digits accuracy using the print function.
>print((1-normaldis(1005,mean(M),dev(M)))*100,2,unit="%")
3.07%
For the next example, we take the number of men in given size ranges.
>r=155.5:4:187.5; v=[22,71,136,169,139,71,32,8];
Here is a plot of the distribution.
>plot2d(r,v,a=150,b=200,c=0,d=190,bar=1,style="\/"):
We can put such raw data into a table. Tables are a method to store statistical data. Our table should contain three colums: Start of range, end of range, number of men in the range. Tables can be printed with headers. We use a vector of strings to set the headers.
>T:=r[1:8]' | r[2:9]' | v'; writetable(T,labc=["from","to","count"])
from to count 155.5 159.5 22 159.5 163.5 71 163.5 167.5 136 167.5 171.5 169 171.5 175.5 139 175.5 179.5 71 179.5 183.5 32 183.5 187.5 8
If we want the mean value and other statistics of the sizes, we need to compute the midpoint of the ranges. We could use the first two columns of our table for this. But it is easier, to fold the ranges with the vector [1/2,1/2].
>l=fold(r,[0.5,0.5])
[ 157.5 161.5 165.5 169.5 173.5 177.5 181.5 185.5 ]
Now we can compute the mean and deviation of the sample with the given frequencies.
>{m,d}=meandev(l,v); m, d,
169.901234568 5.98912964449
Let us add the normal distribution of the values to the plot.
>plot2d("qnormal(x,m,d)*sum(v)*4", ... xmin=min(r),xmax=max(r),thickness=3,add=1):
In the directory of this notebook there is a file with a table. The data represent the results of a survey. Here are the first four lines of the file. The data are from an German online book "Einführung in die Statistik mit R" by A. Handl.
>printfile("table.dat",4);
Person Sex Age Titanic Evaluation Tip Problem 1 m 30 n . 1.80 n 2 f 23 y g 1.80 n 3 f 26 y g 1.80 y
The table contains 7 columns of numbers or tokens (strings). We read the table first using our own translation for the tokens. First we define sets of tokens. The function strtokens() gets a string vector of tokens from a given string.
>mf:=["m","f"]; yn:=["y","n"]; ev:=strtokens("g vg m b vb");
Now we read the table with these translations. The arguments tok2, tok4 etc. are the translations of the columns of the table.
>{MT,hd}=readtable("table.dat",tok2=mf,tok4=yn,tok5=ev,tok7=yn);
For printing, we need to specify the same token sets. We print the first four lines only.
>writetable(MT[1:4],labc=hd,wc=5,tok2=mf,tok4=yn,tok5=ev,tok7=yn);
Person Sex Age Titanic Evaluation Tip Problem 1 m 30 n . 1.8 n 2 f 23 y g 1.8 n 3 f 26 y g 1.8 y 4 m 33 n . 2.8 n
The dots "." represent values, which are not available. If we do not want to specify the tokens for the translation in advance, we only need to specify, which columns contain tokens and not numbers.
>ctok=[2,4,5,7]; {MT,hd,tok}=readtable("table.dat",ctok=ctok);
The function readtable now returns a set of tokens.
>tok
m n f y g vg
The table contains the entries from the file with tokens translated to numbers. The special string NA="." is interpreted as "Not Available", and get NAN (not a number) in the table. This translation can be changed with the parameters NA, and NAval.
>MT[1]
[ 1 1 30 2 NAN 1.8 2 ]
Using the same token columns and the tokens read from the file, we can print the table.
>writetable(MT,labc=hd,ctok=ctok,tok=tok,wc=5);
Person Sex Age Titanic Evaluation Tip Problem 1 m 30 n . 1.8 n 2 f 23 y g 1.8 n 3 f 26 y g 1.8 y 4 m 33 n . 2.8 n 5 m 37 n . 1.8 n 6 m 28 y g 2.8 y 7 f 31 y vg 2.8 n 8 m 23 n . 0.8 n 9 f 24 y vg 1.8 y 10 m 26 n . 1.8 n 11 f 23 y vg 1.8 y 12 m 32 y g 1.8 n 13 m 29 y vg 1.8 y 14 f 25 y g 1.8 y 15 f 31 y g 0.8 n 16 m 26 y g 2.8 n 17 m 37 n . 3.8 n 18 m 38 y g . n 19 f 29 n . 3.8 n 20 f 28 y vg 1.8 n 21 f 28 y m 2.8 y 22 f 28 y vg 1.8 y 23 f 38 y g 2.8 n 24 f 27 y m 1.8 y 25 m 27 n . 2.8 y
Here is the content of the table with untranslated numbers.
>writetable(MT,wc=5)
1 1 30 2 . 1.8 2 2 3 23 4 5 1.8 2 3 3 26 4 5 1.8 4 4 1 33 2 . 2.8 2 5 1 37 2 . 1.8 2 6 1 28 4 5 2.8 4 7 3 31 4 6 2.8 2 8 1 23 2 . 0.8 2 9 3 24 4 6 1.8 4 10 1 26 2 . 1.8 2 11 3 23 4 6 1.8 4 12 1 32 4 5 1.8 2 13 1 29 4 6 1.8 4 14 3 25 4 5 1.8 4 15 3 31 4 5 0.8 2 16 1 26 4 5 2.8 2 17 1 37 2 . 3.8 2 18 1 38 4 5 . 2 19 3 29 2 . 3.8 2 20 3 28 4 6 1.8 2 21 3 28 4 1 2.8 4 22 3 28 4 6 1.8 4 23 3 38 4 5 2.8 2 24 3 27 4 1 1.8 4 25 1 27 2 . 2.8 4
The function tablecol returns the values of columns of the table, skipping any rows with NAN values("." in the file), and the indices of the columns, which contain these values.
>{c,i}=tablecol(MT,[5,6]);
We can use this to extract columns from the table for a new table.
>j=[1,5,6]; writetable(MT[i,j],labc=hd[j],ctok=[2],tok=tok)
Person Evaluation Tip 2 g 1.8 3 g 1.8 6 g 2.8 7 vg 2.8 9 vg 1.8 11 vg 1.8 12 g 1.8 13 vg 1.8 14 g 1.8 15 g 0.8 16 g 2.8 20 vg 1.8 21 m 2.8 22 vg 1.8 23 g 2.8 24 m 1.8
Of course, we can use it also to determine the mean value of a column or any other statistical value.
>mean(tablecol(MT,6))
2.175
The getstatistics function returns the elements in a vector, and their counts. We apply it to the "m" and "f" values in the second column of our table.
>{xu,count}=getstatistics(tablecol(MT,2)); xu, count,
[ 1 3 ] [ 12 13 ]
We can print the result in a new table.
>writetable(count',labr=tok[xu])
m 12 f 13
The function selecttable returns a new table with the values in one column selected from a vector of indices. First we look up the indices of two of our values in the token table.
>v:=indexof(tok,["g","vg"])
[ 5 6 ]
Now we can select the rows of the table, which have any of the values in v in their 5-th row.
>MT1:=MT[selectrows(MT,5,v)]; i:=sortedrows(MT1,5);
Now we can print the table, with extracted and sorted values in the 5-th column.
>writetable(MT1[i],labc=hd,ctok=ctok,tok=tok,wc=7);
Person Sex Age Titanic Evaluation Tip Problem 2 f 23 y g 1.8 n 3 f 26 y g 1.8 y 6 m 28 y g 2.8 y 18 m 38 y g . n 16 m 26 y g 2.8 n 15 f 31 y g 0.8 n 12 m 32 y g 1.8 n 23 f 38 y g 2.8 n 14 f 25 y g 1.8 y 9 f 24 y vg 1.8 y 7 f 31 y vg 2.8 n 20 f 28 y vg 1.8 n 22 f 28 y vg 1.8 y 13 m 29 y vg 1.8 y 11 f 23 y vg 1.8 y
For the next statistic, we want to relate two columns of the table. So we extract column 2 and 4 and sort the table.
>i=sortedrows(MT,[2,4]); ... writetable(tablecol(MT[i],[2,4])',ctok=[1,2],tok=tok)
m n m n m n m n m n m n m n m y m y m y m y m y f n f y f y f y f y f y f y f y f y f y f y f y f y
With getstatistics, we can also relate the counts in two columns of the table to each other.
>MT24=tablecol(MT,[2,4]); ... {xu1,xu2,count}=getstatistics(MT24[1],MT24[2]); ... writetable(count,labr=tok[xu1],labc=tok[xu2])
n y m 7 5 f 1 12
A table can be written to a file.
>filename=eulerhome()+"test.dat"; ... writetable(count,labr=tok[xu1],labc=tok[xu2],file=filename);
Then we can read the table from the file.
>{MT2,hd,tok2,hdr}=readtable(filename,>clabs,>rlabs); ... writetable(MT2,labr=hdr,labc=hd)
n y m 7 5 f 1 12
And delete the file.
>remove(filename);
With plot2d, there is a very easy method to plot a distribution.
>p=normal(1,1000); plot2d(p,distribution=20,style="\/"); ... plot2d("qnormal(x,0,1)",add=1):
Here is a comparison of 10 simulations of 1000 normal distributed values.
>p=normal(10,1000); >boxplot(quantiles(p)):
To generate random integers, we can use intrandom. Let us simulate dice throws and plot the distribution. We use the simple histo() function for this with integer limits. Note that that hist() function returns two arguments, which are both used by plot2d, since histo has the args flag.
>k=intrandom(1,6000,6); ... plot2d(histo(k,>integer),>bar); ... ygrid(1000,color=red):
Euler can produce random values from more distributions. Have a look into the Euler file. E.g., we try the exponential distribution.
>plot2d(randexponential(1,1000,2),>distribution):
For many distributions, Euler can compute the distribution function and the inverse.
>plot2d("normaldis",-4,4):
The following is one way to plot a quantile.
>plot2d("qnormal(x,1,1.5)",-4,6); ... plot2d("qnormal(x,1,1.5)",a=2,b=5,>add,>filled):
The probability to be in the green area is the following.
>normaldis(5,1,1.5)-normaldis(2,1,1.5)
0.248662156979
This can be computed numerically with the following integral.
>gauss("qnormal(x,1,1.5)",2,5)
0.248662156979
Let us compare the binomial distribution with the normal distribution of same mean and deviation.
>invbinsum(0.95,1000,0.5), invnormaldis(0.95,500,0.5*sqrt(1000))
526 526.007419394
The function qdis is the density of the chi-square distribution. As usual Euler maps vectors to this function. Thus we get a plot of all chi-square distributions with degrees 5 to 30 easily in the following way.
>plot2d("qchidis(x,(5:5:50)')",0,50):
Euler has a more accurate function to evaluate the chi-square distribution, using an integration. For practical purposes, the fast function chidis() is of course sufficient.
>chidis(1,2), xchidis(1,2), gauss("qchidis(x,2)",0,1)
0.39346927809 0.393469340287 0.393469340287
To plot data, we try the results of the German elections since 1990, measured in seats.
>BW := [ ... 1990,662,319,239,79,8,17; ... 1994,672,294,252,47,49,30; ... 1998,669,245,298,43,47,36; ... 2002,603,248,251,47,55,2; ... 2005,614,226,222,61,51,54; ... 2009,622,239,146,93,68,76];
For the parties, we use a string of names.
>P:=["CDU/CSU","SPD","FDP","Gr","Li"];
Let us print the percentages nicely. First we extract the necessary columns. Columns 3 to 7 are the seats of each party, and column 2 is the total number of seats. column is the year of the election.
>BT:=BW[,3:7]; BT:=BT/sum(BT); YT:=BW[,1]';
Then we print the statistics in table form. We use the names as column headers, and the years as headers for the rows. The default width for the columns is wc=10, but we prefer a denser output. The columns will be expanded for the labels of the columns, if necessary.
>writetable(BT*100,wc=6,dc=0,>fixed,labc=P,labr=YT)
CDU/CSU SPD FDP Gr Li 1990 48 36 12 1 3 1994 44 38 7 7 4 1998 37 45 6 7 5 2002 41 42 8 9 0 2005 37 36 10 8 9 2009 38 23 15 11 12
The following matrix multiplication extracts the sum of the percentages of the two big parties showing that the small parties have gained footage in the parliament.
>BT1:=(BT.[1;1;0;0;0])'*100
[ 84.2900302115 81.25 81.1659192825 82.7529021559 72.9641693811 61.8971061093 ]
There is also a simplified statistical plot. We use it to display lines and points simultaneously. The alternative is to call plot2d twice with >add.
>statplot(YT,BT1,"b"):
Define some colors for each party.
>CP:=[rgb(0.3,0.3,0.3),red,yellow,green,rgb(0.8,0,0)];
Now we can plot the results of the 2009 election and the changes into one plot using figure. We can add a vector of columns to each plot.
>figure(2,1); ... figure(1); columnsplot(BW[6,3:7],P,color=CP); ... figure(2); columnsplot(BW[6,3:7]-BW[5,3:7],P,color=CP); ... figure(0):
Data plots combine rows of statistical data in one plot.
>J:=BW[,1]'; DP:=BW[,3:7]'; ... dataplot(YT,BT',color=CP); ... labelbox(P,colors=CP,styles="[]",>points,w=0.2):
A 3D columns plot shows rows of statistical data in form of columns. We provide labels for the rows and the columns. angle is the viewing angle.
>columnsplot3d(BT,scols=P,srows=YT, ... angle=30°,ccols=CP):
Another representation is the mosaic plot. Note that the columns of the plot represent the columns of the matrix here.
>mosaicplot(BT',srows=YT,scols=P,color=CP,style="#"):
We can also do a pie chart. Since black and yellow form a coalition, we reorder the elements.
>i=[1,3,5,4,2]; piechart(BW[6,3:7][i],color=CP[i],lab=P[i]):
Here is another kind of plot.
>starplot(normal(1,10)+4,lab=1:10,>rays):
Some plots in plot2d are good for statics. Here is an impulse plot of random data, equidistributed in [0,1].
>plot2d(makeimpulse(1:10,random(1,10)),>bar):
But for exponentially distributed data, we may need a logplot.
>logimpulseplot(1:10,-log(random(1,10))*10):
The following is a way to plot the frequencies of numbers in a vector. We create a vector of integer random numbers 1 to 6.
>v:=intrandom(1,10,10)
[ 6 5 8 10 6 1 5 6 9 2 ]
Then extract the unique numbers in v.
>vu:=unique(v)
[ 1 2 5 6 8 9 10 ]
And plot the frequencies in a columns plot.
>columnsplot(getmultiplicities(vu,v),lab=vu,style="/"):
We want to use functions for the empirical distribution of values.
>x=normal(1,20);
The function empdist(x,vs) needs a sorted array of values. So we have to sort x before we can use it.
>xs=sort(x);
Then we plot the empirical distribution and some density bars into one plot. Instead of a bar plot for the distribution we use a sawtooth plot this time.
>figure(2,1); ... figure(1); plot2d("empdist",-4,4;xs); ... figure(2); plot2d(histo(x,v=-4:0.2:4,<bar)); ... figure(0):
A scatterplot is easy to do in Euler with the usual point plot. The following graph shows that the X and X+Y are clearly positively correlated.
>x=normal(1,100); plot2d(x,x+rotright(x),>points,style=".."):
For another example we read a survey of students, their ages, the ages of their parents and the number of siblings from a file. This table contains "m" and "f" in the second column. We use the variable tok2 to set the proper translations instead of letting readtable collect the translations.
>{MS,hd}:=readtable("table1.dat",tok2=["m","f"]); ... writetable(MS,labc=hd,tok2=["m","f"]);
Person Sex Age Mother Father Siblings 1 m 29 58 61 1 2 f 26 53 54 2 3 m 24 49 55 1 4 f 25 56 63 3 5 f 25 49 53 0 6 f 23 55 55 2 7 m 23 48 54 2 8 m 27 56 58 1 9 m 25 57 59 1 10 m 24 50 54 1 11 f 26 61 65 1 12 m 24 50 52 1 13 m 29 54 56 1 14 m 28 48 51 2 15 f 23 52 52 1 16 m 24 45 57 1 17 f 24 59 63 0 18 f 23 52 55 1 19 m 24 54 61 2 20 f 23 54 55 1
How do the ages depend on each other? A first impression comes from a pairwise scatterplot.
>scatterplots(tablecol(MS,3:5),hd[3:5]):
It is clear that the age of the father and mother depend on each other. Let us determine and plot the regression line.
>cs:=MS[,4:5]'; ps:=polyfit(cs[1],cs[2],1)
[ 17.3789156627 0.740963855422 ]
This is obviously the wrong model. The regression line would be s=17+0.74t, where t is the age of the mother and s the age of the father. The age difference may depend a little bit on the age, but not that much. We rather suspect something like s=a+t. Then a is the mean of the s-t. It is the average age difference between fathers and mothers.
>da:=mean(cs[2]-cs[1])
3.65
Let us plot this into one scatter plot.
>plot2d(cs[1],cs[2],>points); ... plot2d("evalpoly(x,ps)",color=red,style=".",>add); ... plot2d("x+da",color=blue,>add):
Here is a box plot of the two ages. This only shows, that the ages are different.
>boxplot(quantiles(cs),["mothers","fathers"]):
It is interesting that the difference in medians is not as large as the difference in means.
>median(cs[2])-median(cs[1])
1.5
The correlation coefficient suggests a positive correlation.
>correl(cs[1],cs[2])
0.7588307236
The correlation of the ranks is a measure for the same order in both vectors. It is also quite positive.
>rankcorrel(cs[1],cs[2])
0.758925292358
Euler can be used to simulate random events. We have already seen simple examples. Here is another one, which simulates 1000 times 3 dice throws, and asks for the distribution of the sums.
>ds:=sum(intrandom(1000,3,6))'; fs=getmultiplicities(3:18,ds)
[ 1 10 27 43 77 110 111 138 114 124 98 58 51 21 13 4 ]
We can plot this now.
>columnsplot(fs,lab=3:18);
To determine the true distribution is not so easy. We use an advanced recursion for this. The following function counts the number of ways the number k can be represented as the sum of n numbers in the range of 1 to m. It works recursively in an obvious way, once you understand how recursion works at all.
>function map countways (k; n, m) ...
if n==1 then return k>=1 && k<=m else sum=0; loop 1 to m; sum=sum+countways(k-#,n-1,m); end; return sum; end; endfunction
Here is the result for three throws of dices.
>cw=countways(3:18,3,6)
[ 1 3 6 10 15 21 25 27 27 25 21 15 10 6 3 1 ]
We add the expected values to the plot.
>plot2d(cw/6^3*1000,>add); plot2d(cw/6^3*1000,>points,>add):
For another simulation, the deviation of the mean value of n 0-1-normal distributed random variables is 1/sqrt(n).
>longformat; 1/sqrt(10)
0.316227766017
Let us check this with a simulation. We produce 10000 times 10 random vectors.
>M=normal(10000,10); dev(mean(M)')
0.314189042955
>plot2d(mean(M)',>distribution):
The median of 10 0-1-normal distributed random numbers has a larger deviation.
>dev(median(M)')
0.368432997022
Next we test dice throws for equidistribution. At 600 throws, we got the following values, which we insert into the chi-square test.
>chitest([90,103,114,101,103,89],dup(100,6)')
0.498830731183
So we do not reject equi-distribution. Next we generate 1000 dice throws using the random number generator, and do the same test.
>n=1000; t=random([1,n*6]); chitest(count(t*6,6),dup(n,6)')
0.311903808362
A result less than 0.05 leads to the assumption of a false dice. This may happen every 20-th time. Let us test for the mean value 100 with the t-test.
>s=100+normal([1,100])*10; >ttest(mean(s),dev(s),10,100)
0.478980787552
The function expects the mean, the deviation and the number of data, and the mean value to test for. Now let us check two measurements for the same mean. We reject if the result is <0.05.
>tcomparedata(normal(1,10),normal(1,10))
0.384604341208
If we add a bias to one distribution, we get more rejections. Repeat this test several times to see the effect.
>tcomparedata(normal(1,10),normal(1,10)+1)
0.0318703758486
In the next example, we generate 20 random dice throws 100 times and count the ones in it. There must be 20/6=3.3 ones on average.
>R=random(1000,20); R=sum(R*6<=1)'; mean(R)
3.296
We now compare the number of ones with the binomial distribution. First we plot the distribution of ones.
>plot2d(R,distribution=max(R)+1,even=1,style="\/"):
>t=count(R,21);
Then we compute the expected values.
>n=0:20; b=bin(20,n)*(1/6)^n*(5/6)^(20-n)*100;
We now collect several numbers to get categories, which are big enough.
>t1=sum(t[1:2])|t[3:7]|sum(t[8:21]); >b1=sum(b[1:2])|b[3:7]|sum(b[8:21]);
The chitest rejects our result, if its result is <0.05.
>chitest(t1,b1)
0
The following example contains results of two groups of persons (male and female, say) voting for one out of six parties.
>A=[23,37,43,52,64,74;27,39,41,49,63,76]; ... writetable(A,wc=6,labr=["m","f"],labc=1:6)
1 2 3 4 5 6 m 23 37 43 52 64 74 f 27 39 41 49 63 76
We wish to test for independence of the votes from the sex. The chi^2 table test does this. The result is way to large to reject independence. So we cannot say, if the voting depends on the sex from these data.
>tabletest(A)
0.990701635263
The following is the expected table, if we assume the observed frequencies of voting.
>writetable(expectedtable(A),wc=6,dc=1,labr=["m","f"],labc=1:6)
1 2 3 4 5 6 m 24.9 37.9 41.9 50.3 63.3 74.7 f 25.1 38.1 42.1 50.7 63.7 75.3
We can compute the corrected contingency coefficient. Since it very close to 0, we conclude that the voting does not depend on the sex.
>contingency(A)
0.0427225484717
Let us demonstrate how to read and write a vector of reals to a file.
>a=random(1,100); mean(a), dev(a),
0.5344472956 0.291476286915
To write the data to a file, we use the writematrix function. Since this introduction notebook is most likely in a directory, where the user has no write access, we write the data to the user home directory. For own notebooks, this is not necessary, since the data file will be written into the notebook directory.
>filename=userhome()|"test.dat"
C:\Users\Rene\test.dat
Now we write the column vector a' to the file. This yields one number in each line of the file.
>writematrix(a',filename);
To read the data, we use readmatrix.
>a=readmatrix(filename)';
And remove the file.
>remove(filename); >mean(a), dev(a),
0.5344472956 0.291476286915
Sometimes we have strings with tokens like the following.
>s1:="f m m f m m m f f f m m f"; ... s2:="f f f m m f f";
To tokenize this, we define a vector of tokens.
>tok:=["f","m"]
f m
Then we can count the number of times each token appears in the string, and put the result into a table.
>M:=getmultiplicities(tok,strtokens(s1))_ ... getmultiplicities(tok,strtokens(s2));
Write the table with the token headers.
>writetable(M,labc=tok,labr=1:2,wc=8)
f m 1 6 7 2 5 2
Next we use a variance analysis (F-test) to test three samples of normally distributed data for same mean value.
>x1=[109,111,98,119,91,118,109,99,115,109,94]; mean(x1),
106.545454545
>x2=[120,124,115,139,114,110,113,120,117]; mean(x2),
119.111111111
>x3=[120,112,115,110,105,134,105,130,121,111]; mean(x3)
116.3
>varanalysis(x1,x2,x3)
0.0136248489721
This means, we reject the hypothesis of same mean value. We can compute the binomial distribution. First there is the binomialsum function, which returns the probability of i or less hits out of n trials.
>binsum(410,1000,0.4)
0.751401349654
The normalsum function approximates the binomial distribution with the normal distribution and is much faster.
>normalsum(410,1000,0.4)
0.751041893668
The following commands are the direct way to get the above result. But for large n, the direct summation is not accurate and slow.
>p=0.4; i=0:410; n=1000; sum(bin(n,i)*p^i*(1-p)^(n-i))
0.751401349655
By the way, invbinsum computes the inverse of binomialsum.
>invbinsum(0.75,1000,0.4)
410
In Bridge, we assume 5 outstanding cards (out of 52) in two hands (26 cards). Let us compute the probability of a distribution worse than 3:2 (e.g. 0:5, 1:4, 4:1 or 5:0).
>2*hypergeomsum(1,5,13,26)
0.321739130435
An application is the median test, which rejects data samples with different mean distribution testing the median of the united sample.
>a=[56,66,68,49,61,53,45,58,54]; >b=[72,81,51,73,69,78,59,67,65,71,68,71]; >mediantest(a,b)
0.0241724220052
Another test on equality is the rank test. It is much sharper than the median test.
>ranktest(a,b)
0.00199969612469
In the following example, both distributions have the same mean.
>ranktest(random(1,100),random(1,50)*3-1)
0.139095655024
Let us now try to simulate two treatments a and b applied to different persons.
>a=[8.0,7.4,5.9,9.4,8.6,8.2,7.6,8.1,6.2,8.9]; >b=[6.8,7.1,6.8,8.3,7.9,7.2,7.4,6.8,6.8,8.1];
The signum test decides, if a is better than b.
>signtest(a,b)
0.0546875
This is too much of an error. We cannot reject that a is as good as b. The Wilcoxon test is sharper than this test, but relies on the quantitative value of the differences.
>wilcoxon(a,b)
0.0296680599403
Let us try two more tests using generated series.
>wilcoxon(normal(1,20),normal(1,20)-1)
0.000509414522477
>wilcoxon(normal(1,20),normal(1,20))
0.382599222249
Linear regression can be done with the functions polyfit or various fit functions. For a start we find the regression line for univariate data with polyfit(x,y,1).
>x=1:10; y=[2,3,1,5,6,3,7,8,9,8]; writetable(x'|y',labc=["x","y"])
x y 1 2 2 3 3 1 4 5 5 6 6 3 7 7 8 8 9 9 10 8
We want to compare non-weighted and weighted fits. First the coefficiens of the linear fit.
>p=polyfit(x,y,1)
[ 0.733333333333 0.812121212121 ]
Now the coefficients with a weight that emphasizes the last values.
>w&="exp(-(x-10)^2/10)"; pw=polyfit(x,y,1,w=w(x))
[ 4.71566356925 0.383190230089 ]
We put everything into one plot for the points and the regression lines, and for the weights used.
>figure(2,1); ... figure(1); statplot(x,y,"b",xl="Regression"); ... plot2d("evalpoly(x,p)",>add,color=blue,style="--"); ... plot2d("evalpoly(x,pw)",5,10,>add,color=red,style="--"); ... figure(2); plot2d(w,1,10,>filled,style="/",fillcolor=red,xl=w); ... figure(0):
The following is a test for the random number generator. Euler uses a very good generator, so we need not expect any problems. First we generate ten millions of random numbers in [0,1].
>n:=10000000; r:=random(1,n);
Next we count the distances between two numbers less than 0.05.
>a:=0.05; d:=differences(nonzeros(r<a));
Finally, we plot the number of times, each distance occured, and compare with the expected value.
>m=getmultiplicities(1:100,d); plot2d(m); ... plot2d("n*(1-a)^(x-1)*a^2",color=red,>add):
Clear the data.
>remvalue n;
There are some more examples in the examples directory of Euler. Examples/Black Swan Distribution Examples/US Population Forecast Examples/Sunspot Data Examples/Econometrics Examples/Benfords Law