forecasting:large_sets_of_data

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
forecasting:large_sets_of_data [2015/12/08 02:46]
jeremygg
forecasting:large_sets_of_data [2021/09/19 21:59] (current)
Line 2: Line 2:
 One way to increase the reliability of your predictions is simply observing more samples. By observing more samples, the models and estimated probabilities are more concrete. ​ One way to increase the reliability of your predictions is simply observing more samples. By observing more samples, the models and estimated probabilities are more concrete. ​
  
-When you are given a single sample x<​sub>​i</​sub>​ from some set of data X, that gives something from some probabilistic or deterministic distribution,​ the only thing you can say is this set contains this sample. From the data so far, you can try to assume that this sample represents the mean of this set and the current "​un-biased"​ variance is 0. Notice how false the logic is. If this sample is pulled of a non-discrete/​analog signal, the probability of that sample happening is technically 0. *Insert equation of Probability* If you want to find the probability of a certain point on a continuous random probability,​ the limits of integration would be from x<​sub>​i</​sub>​ to x</sub>​i</​sub>,​ which is 0, the area of a small thin line is 0.+When you are given a single sample x<​sub>​i</​sub>​ from some set of data X, that gives something from some probabilistic or deterministic distribution,​ the only thing you can say is this set contains this sample. From the data so far, you can try to assume that this sample represents the mean of this set and the current "​un-biased"​ variance is 0. Notice how false the logic is. If this sample is pulled of a non-discrete/​analog signal, the probability of that sample happening is technically 0. *Insert equation of Probability* If you want to find the probability of a certain point on a continuous random probability,​ the limits of integration would be from x<​sub>​i</​sub>​ to x<​sub>​i</​sub>,​ which is 0, the area of a small thin line is 0.
  
 Now as you increase from the number of samples from 1 to 2, you have to samples. You can calculate your "​un-biased variance"​ from this equation: *insert equation*. Now this would be something called a Bernoulli random variable ~B(p) where p is the probability of one of the data sets. Specifically its a ~B(.5) except at different domains because half of your samples is 1 number, and the other half is another number. You cannot say your source from the set is Bernoulli. Obviously If you take any set of numbers, from any distribution and get told to pick 2 numbers, you would always get a ~B(1/2). However the probability of the mean of those 2 numbers is closer to the expected value of (X) than just taking the mean of your first sample which is simply that sample. As you increase your samples, the number of samples will start to approximate your actual distribution. The law of large numbers states that the more samples you get from a distribution,​ the average of the samples is the expected value/the ideal average of that set. As the number of samples, let us call this N, approaches infinite, the mean and variance of the set that was sampled from the distribution is equal to the mean and the variance of the distribution. Simply by taking more samples before doing your prediction gives your model more confident on how accurate the set X works and shaped. ​ Now as you increase from the number of samples from 1 to 2, you have to samples. You can calculate your "​un-biased variance"​ from this equation: *insert equation*. Now this would be something called a Bernoulli random variable ~B(p) where p is the probability of one of the data sets. Specifically its a ~B(.5) except at different domains because half of your samples is 1 number, and the other half is another number. You cannot say your source from the set is Bernoulli. Obviously If you take any set of numbers, from any distribution and get told to pick 2 numbers, you would always get a ~B(1/2). However the probability of the mean of those 2 numbers is closer to the expected value of (X) than just taking the mean of your first sample which is simply that sample. As you increase your samples, the number of samples will start to approximate your actual distribution. The law of large numbers states that the more samples you get from a distribution,​ the average of the samples is the expected value/the ideal average of that set. As the number of samples, let us call this N, approaches infinite, the mean and variance of the set that was sampled from the distribution is equal to the mean and the variance of the distribution. Simply by taking more samples before doing your prediction gives your model more confident on how accurate the set X works and shaped. ​
Line 13: Line 13:
  
 ===Solution 1)=== ===Solution 1)===
- Note there is many solution. One solution is to take your first the very first offer, maybe the first offer that you are content with over some value $Y, and you leave to the next bar.+ Note there is many solutions/​plan of attacks. One solution is to take your first the very first offer, maybe the first offer that you are content with over some value $Y, and you leave to the next bar.
  
 The hardest part is you don't know the distribution of the amount of money each person is willing to give you. If you wait for $Y, maybe not a single is someone willing to give you that much. Maybe there is a person willing to give you that much, but there is a person willing to give you double that amount! We are going to make a couple assumptions. The place is so popular that there will always be a constant flow of people coming out of the bar. Another assumption is he is a committed beggar with lots of time and willing to stick there for a finite or infinite amount of time. Thirdly, the amount of money given when each person that is asked are independent of one another. Since the smallest amount of money is a penny, the distribution X (The number of money a person is willing to give) is technically discrete. ​ The hardest part is you don't know the distribution of the amount of money each person is willing to give you. If you wait for $Y, maybe not a single is someone willing to give you that much. Maybe there is a person willing to give you that much, but there is a person willing to give you double that amount! We are going to make a couple assumptions. The place is so popular that there will always be a constant flow of people coming out of the bar. Another assumption is he is a committed beggar with lots of time and willing to stick there for a finite or infinite amount of time. Thirdly, the amount of money given when each person that is asked are independent of one another. Since the smallest amount of money is a penny, the distribution X (The number of money a person is willing to give) is technically discrete. ​
  
-Lets say the maximum a person is willing to give is y<​sub>​m</​sub>​. The probability of someone asking to give you the max amount of money is P<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​). or the CDF 1-F<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​-.01). The probability of getting the maximum in the first n people would be equal to P<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​)<​sup>​n</​sup>​. The probability of the maximum offer within the first n people being y<​sub>​l</​sub>​ F<​sub>​X</​sub>​(X=y<​sub>​l</​sub>​)sup>​n</​sup>​. Try this with most distributions that met the assumptions and if n is large, the probability gets larger if y<​sub>​l</​sub>​ gets large. A good plan of attack is to ask the first n people for money, but don't take any. Then after n people, take the first offer that matches or meets that maximum. A flaw is if you were very very unlucky and you got stingy people for your first n, this would be unlikely if you get n large enough (As long as he doesn'​t die form old age.) Secondly if everyone who was going to give you an amount of y<​sub>​l</​sub>​ or greater has already passed and no one else would give you. The guy is stuck forever, but since we assume independence and a constant flow of people, the probability of this happening is 0. Lastly is you may want to find the n that is optimal where its the time he should spend waiting versus cash out which goes against the fact that he is a committed beggar.+Lets say the maximum a person is willing to give is y<​sub>​m</​sub>​. The probability of someone asking to give you the max amount of money is P<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​). or the CDF 1-F<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​-.01). The probability of getting the maximum in the first n people would be equal to P<​sub>​X</​sub>​(X=y<​sub>​m</​sub>​)<​sup>​n</​sup>​. The probability of the maximum offer within the first n people being y<​sub>​l</​sub>​ F<​sub>​X</​sub>​(X=y<​sub>​l</​sub>​)<sup>​n</​sup>​. Try this with most distributions that met the assumptions and if n is large, the probability gets larger if y<​sub>​l</​sub>​ gets large. A good plan of attack is to ask the first n people for money, but don't take any. Then after n people, take the first offer that matches or meets that maximum. A flaw is if you were very very unlucky and you got stingy people for your first n, this would be unlikely if you get n large enough (As long as he doesn'​t die form old age.) Secondly if everyone who was going to give you an amount of y<​sub>​l</​sub>​ or greater has already passed and no one else would give you. The guy is stuck forever, but since we assume independence and a constant flow of people, the probability of this happening is 0. Lastly is you may want to find the n that is optimal where its the time he should spend waiting versus cash out which goes against the fact that he is a committed beggar.
  
 +===Problem 2)===
 +You are part of a game show where people can join in and out of the game at any time. The show host picks a ball with a random number j on it which the numbers come from and may repeat from a certain distribution J. Prior to the picking, people give an X amount of money to the host. If you gave the host less than the amount of the number on the ball, you are out of the game. You want to stay on the show for as long as you can for whatever reasons.You decided to just watch for n rounds and keep track of the amount of numbers (Lets call these values X<​sub>​1</​sub>​ to X<​sub>​n</​sub>​). How much would you give to the show host every round?
 +
 +===Solution 2)===
 +Of course you can spend exactly infinity dollars every round and never get kicked off, but lets excuse this possibility. You could pick a large number C every single round. The probability that you'll get kicked off the show where n is the number of rounds would be *Insert pick of equation sum{0 to inf} (P<​sub>​J</​sub>​(j<​=C))<​sup>​n</​sup>​P<​sub>​J</​sub>​(j<​C) = sum{0 to inf} (F<​sub>​J</​sub>​(j=X)<​sup>​n</​sup>​)(1-F<​sub>​J</​sub>​(j=X)).* If J doesn'​t have a finite maximum value then this is relative to the sum of a geometric probability function (You should review geometric random variables Ex: Flipping coins). This means the probability that you will lose eventually is 1. This solution isn't necessarily bad but lets look at more.
 +
 +Second solution is to keep increasing the maximum value. Let's call the function G(n). Considering that J may be infinity, it is impossible to have a 0 percent probability that you won't get kicked off. But we can say we would at-most want a probability is ε. So that means you want to design G(n) such that *Insert equation ε <= sum(n 0 to inf)P<​sub>​J</​sub>​(j<​=G(n))<​sup>​n</​sup>​P<​sub>​J</​sub>​(j>​G(n)) = sum{0 to inf} (1-F<​sub>​J</​sub>​(j=G(n)))<​sup>​n</​sup>​F<​sub>​J</​sub>​(j=G(n)).
 +
 +Finally you realize, well you are going to lose eventually, but you want to have control the rate you will leave. You realize that the G(n) increases and you will lose way too much money at a rate too fast. So from using what we learned from the first problem, we can just observe for a long time, as n gets very large, we can assure that we can pick the right C from the first scenario. The chances that you picked the wrong C simply by looking at the previous maxes would be the probability that none of the numbers observed was greater than the maximum(X<​sub>​i</​sub>​ multiplied by the probability you'll fail which is less than the following equation since C is even larger than that max *Insert equation F<​sub>​J</​sub>​(j=C)<​sup>​n</​sup>​sum{k 0 to inf} (P<​sub>​J</​sub>​(j<​=C))<​sup>​k</​sup>​P<​sub>​J</​sub>​(j>​C) = F<​sub>​J</​sub>​(j=C)<​sup>​n</​sup>​. Note that this is either a very small probability if C is not sufficiently large enough, but a high probability overall but the probability of losing a single round becomes very small. It is also good to note that the larger C is compared to the maximum of the X<​sub>​i</​sub>,​ it will decrease the probability,​ but increase your cost too. An '​OK'​ assumption to make is the probability of a number large number n between 0 to infinity doesn'​t increase as n gets larger, which is not necessarily true but it is very common in most probability distribution function. ​
 +
 +A final method if you don't want to sample long enough is to take the mean of the current data you have and make sure your number is greater than the mean by a r number of deviations where you can pick r for probability vs cost and assume J is Gaussian like even though it probably isn't. THE MAIN POINT OF THIS THOUGH IS THE METHOD BEFORE THIS ONE, OBSERVING PAST VALUES IS A GOOD WAY TO REASSURE YOURSELF!
  
 ===Major Complications=== ===Major Complications===
  • forecasting/large_sets_of_data.1449542801.txt.gz
  • Last modified: 2021/09/19 21:59
  • (external edit)