forecasting:large_sets_of_data

One way to increase the reliability of your predictions is simply observing more samples. By observing more samples, the models and estimated probabilities are more concrete.

When you are given a single sample xi from some set of data X, that gives something from some probabilistic or deterministic distribution, the only thing you can say is this set contains this sample. From the data so far, you can try to assume that this sample represents the mean of this set and the current “un-biased” variance is 0. Notice how false the logic is. If this sample is pulled of a non-discrete/analog signal, the probability of that sample happening is technically 0. *Insert equation of Probability* If you want to find the probability of a certain point on a continuous random probability, the limits of integration would be from xi to xi, which is 0, the area of a small thin line is 0.

Now as you increase from the number of samples from 1 to 2, you have to samples. You can calculate your “un-biased variance” from this equation: *insert equation*. Now this would be something called a Bernoulli random variable ~B(p) where p is the probability of one of the data sets. Specifically its a ~B(.5) except at different domains because half of your samples is 1 number, and the other half is another number. You cannot say your source from the set is Bernoulli. Obviously If you take any set of numbers, from any distribution and get told to pick 2 numbers, you would always get a ~B(1/2). However the probability of the mean of those 2 numbers is closer to the expected value of (X) than just taking the mean of your first sample which is simply that sample. As you increase your samples, the number of samples will start to approximate your actual distribution. The law of large numbers states that the more samples you get from a distribution, the average of the samples is the expected value/the ideal average of that set. As the number of samples, let us call this N, approaches infinite, the mean and variance of the set that was sampled from the distribution is equal to the mean and the variance of the distribution. Simply by taking more samples before doing your prediction gives your model more confident on how accurate the set X works and shaped.

Note I am not the creator of these problems but these are pretty famous and common problems.

Problem 1)

Suppose there is a beggar outside of a pretty fancy bar. The owner of the fancy bar him to stop begging and leave. As the beggar pleaded, the owner allowed him to stay until a person gives him money. Lucky for him, everyone in this bar is very nice and willing to help to an extent. What should this beggar do?

Solution 1)

Note there is many solutions/plan of attacks. One solution is to take your first the very first offer, maybe the first offer that you are content with over some value $Y, and you leave to the next bar.

The hardest part is you don't know the distribution of the amount of money each person is willing to give you. If you wait for $Y, maybe not a single is someone willing to give you that much. Maybe there is a person willing to give you that much, but there is a person willing to give you double that amount! We are going to make a couple assumptions. The place is so popular that there will always be a constant flow of people coming out of the bar. Another assumption is he is a committed beggar with lots of time and willing to stick there for a finite or infinite amount of time. Thirdly, the amount of money given when each person that is asked are independent of one another. Since the smallest amount of money is a penny, the distribution X (The number of money a person is willing to give) is technically discrete.

Lets say the maximum a person is willing to give is ym. The probability of someone asking to give you the max amount of money is PX(X=ym). or the CDF 1-FX(X=ym-.01). The probability of getting the maximum in the first n people would be equal to PX(X=ym)n. The probability of the maximum offer within the first n people being yl FX(X=yl)n. Try this with most distributions that met the assumptions and if n is large, the probability gets larger if yl gets large. A good plan of attack is to ask the first n people for money, but don't take any. Then after n people, take the first offer that matches or meets that maximum. A flaw is if you were very very unlucky and you got stingy people for your first n, this would be unlikely if you get n large enough (As long as he doesn't die form old age.) Secondly if everyone who was going to give you an amount of yl or greater has already passed and no one else would give you. The guy is stuck forever, but since we assume independence and a constant flow of people, the probability of this happening is 0. Lastly is you may want to find the n that is optimal where its the time he should spend waiting versus cash out which goes against the fact that he is a committed beggar.

Problem 2)

You are part of a game show where people can join in and out of the game at any time. The show host picks a ball with a random number j on it which the numbers come from and may repeat from a certain distribution J. Prior to the picking, people give an X amount of money to the host. If you gave the host less than the amount of the number on the ball, you are out of the game. You want to stay on the show for as long as you can for whatever reasons.You decided to just watch for n rounds and keep track of the amount of numbers (Lets call these values X1 to Xn). How much would you give to the show host every round?

Solution 2)

Of course you can spend exactly infinity dollars every round and never get kicked off, but lets excuse this possibility. You could pick a large number C every single round. The probability that you'll get kicked off the show where n is the number of rounds would be *Insert pick of equation sum{0 to inf} (PJ(j⇐C))nPJ(j<C) = sum{0 to inf} (FJ(j=X)n)(1-FJ(j=X)).* If J doesn't have a finite maximum value then this is relative to the sum of a geometric probability function (You should review geometric random variables Ex: Flipping coins). This means the probability that you will lose eventually is 1. This solution isn't necessarily bad but lets look at more.

Second solution is to keep increasing the maximum value. Let's call the function G(n). Considering that J may be infinity, it is impossible to have a 0 percent probability that you won't get kicked off. But we can say we would at-most want a probability is ε. So that means you want to design G(n) such that *Insert equation ε ⇐ sum(n 0 to inf)PJ(j⇐G(n))nPJ(j>G(n)) = sum{0 to inf} (1-FJ(j=G(n)))nFJ(j=G(n)).

Finally you realize, well you are going to lose eventually, but you want to have control the rate you will leave. You realize that the G(n) increases and you will lose way too much money at a rate too fast. So from using what we learned from the first problem, we can just observe for a long time, as n gets very large, we can assure that we can pick the right C from the first scenario. The chances that you picked the wrong C simply by looking at the previous maxes would be the probability that none of the numbers observed was greater than the maximum(Xi multiplied by the probability you'll fail which is less than the following equation since C is even larger than that max *Insert equation FJ(j=C)nsum{k 0 to inf} (PJ(j⇐C))kPJ(j>C) = FJ(j=C)n. Note that this is either a very small probability if C is not sufficiently large enough, but a high probability overall but the probability of losing a single round becomes very small. It is also good to note that the larger C is compared to the maximum of the Xi, it will decrease the probability, but increase your cost too. An 'OK' assumption to make is the probability of a number large number n between 0 to infinity doesn't increase as n gets larger, which is not necessarily true but it is very common in most probability distribution function.

A final method if you don't want to sample long enough is to take the mean of the current data you have and make sure your number is greater than the mean by a r number of deviations where you can pick r for probability vs cost and assume J is Gaussian like even though it probably isn't. THE MAIN POINT OF THIS THOUGH IS THE METHOD BEFORE THIS ONE, OBSERVING PAST VALUES IS A GOOD WAY TO REASSURE YOURSELF!

Major Complications

One thing you would want to know however is if it is time-invariant. If you take a sample from a signal or data right now, who is to say the distribution would be the same in a couple seconds, a couple hours, a couple days or years from now. Most Sensors are in-fact variant on time. You can test this by taking the auto correlation between all of your points. This won't be accurate unless you take a bunch of data of specific times in either sec, hours, days, years etc. If it approximates to an identity matrix, you can say its a stationary/time-invariant signal. If not you can do a couple of things. One is if the data changes a bunch after long strides of time (varies depending on what you are sampling (Hence solar radiation has seasonal effects so a couple months, tracking a fast moving object would be a couple of seconds). Within these small strides of time you can make your model with that set of data within that set of time. Now you are trading off reliable data with number of data, quality vs quantity. You can try to increase weight on quality data and less weight on the rest of the data, or you can sample faster so you still have the large sets of data within that small interval of time.

Another way is too see from your data if it is periodic. Note there is no telling that the signal is periodic unless you are in a stable system and you have been taking an infinite amount of samples. However, if you simply take take a lot of samples, you can say it is periodic with a certain confidence. Let us say it is periodic from na to nb, which means it has a period of nb-na which we shall call T. You could do something on normalization. If your data is periodic, you can try to find mean and standard deviation of the data of a certain time t+kT where t is any point between na to nb, and k is any integer. If you find the mean, you could find the average and literally map at specific times to the expected of output at that specific time. After finding the mean and standard deviation at t+kT, you can split the data into multiple sets of data and multiple sets of distribution at specific times. More information on normalization can be found in the main resources.

Authors

Contributing authors:

jeremygg

Created by jeremygg on 2015/12/08 02:34.

  • forecasting/large_sets_of_data.txt
  • Last modified: 2021/09/19 21:59
  • (external edit)