forecasting:large_sets_of_data [Smart Campus Energy Lab Wiki]

This is an old revision of the document!

One way to increase the reliability of your predictions is simply observing more samples. By observing more samples, the models and estimated probabilities are more concrete.

When you are given a single sample x_i from some set of data X, that gives something from some probabilistic or deterministic distribution, the only thing you can say is this set contains this sample. From the data so far, you can try to assume that this sample represents the mean of this set and the current “un-biased” variance is 0. Notice how false the logic is. If this sample is pulled of a non-discrete/analog signal, the probability of that sample happening is technically 0. *Insert equation of Probability* If you want to find the probability of a certain point on a continuous random probability, the limits of integration would be from x_i to x</sub>i</sub>, which is 0, the area of a small thin line is 0.

Now as you increase from the number of samples from 1 to 2, you have to samples. You can calculate your “un-biased variance” from this equation: *insert equation*. Now this would be something called a Bernoulli random variable ~B(p) where p is the probability of one of the data sets. Specifically its a ~B(.5) except at different domains because half of your samples is 1 number, and the other half is another number. You cannot say your source from the set is Bernoulli. Obviously If you take any set of numbers, from any distribution and get told to pick 2 numbers, you would always get a ~B(1/2). However the probability of the mean of those 2 numbers is closer to the expected value of (X) than just taking the mean of your first sample which is simply that sample. As you increase your samples, the number of samples will start to approximate your actual distribution. The law of large numbers states that the more samples you get from a distribution, the average of the samples is the expected value/the ideal average of that set. As the number of samples, let us call this N, approaches infinite, the mean and variance of the set that was sampled from the distribution is equal to the mean and the variance of the distribution. Simply by taking more samples before doing your prediction gives your model more confident on how accurate the set X works and shaped.

Brain Teasers: Note I am not the creator of these problems but these are pretty famous and common problems.

Problem 1) Suppose there is a beggar outside of a pretty fancy bar. The owner of the fancy bar him to stop begging and leave. As the beggar pleaded, the owner allowed him to stay until a person gives him money. Lucky for him, everyone in this bar is very nice and willing to help to an extent. What should this beggar do?

Solution 1) Note there is many solution. One solution is to take your first the very first offer, maybe the first offer that you are content with over some value $Y, and you leave to the next bar. The hardest part is you don't know the distribution of the amount of money each person is willing to give you. If you wait for$ Y, maybe not a single is someone willing to give you that much. Maybe there is a person willing to give you that much, but there is a person willing to give you double that amount! We are going to make a couple assumptions. The place is so popular that there will always be a constant flow of people coming out of the bar. Another assumption is he is a committed beggar with lots of time and willing to stick there for a finite or infinite amount of time. Thirdly, the amount of money given when each person that is asked are independent of one another. Since the smallest amount of money is a penny, the distribution X (The number of money a person is willing to give) is technically discrete.

Lets say the maximum a person is willing to give is y_m. The probability of someone asking to give you the max amount of money is P_X(X=y_m). or the CDF 1-F_X(X=y_m-.01)

Warning: One thing you would want to know however is if it is time-invariant. If you take a sample from a signal or data right now, who is to say the distribution would be the same in a couple seconds, a couple hours, a couple days or years from now. Most Sensors are in-fact variant on time. You can test this by taking the auto correlation between all of your points. This won't be accurate unless you take a bunch of data of specific times in either sec, hours, days, years etc. If it approximates to an identity matrix, you can say its a stationary/time-invariant signal. If not you can do a couple of things. One is if the data changes a bunch after long strides of time (varies depending on what you are sampling (Hence solar radiation has seasonal effects so a couple months, tracking a fast moving object would be a couple of seconds). Within these small strides of time you can make your model with that set of data within that set of time. Now you are trading off reliable data with number of data, quality vs quantity. You can try to increase weight on quality data and less weight on the rest of the data, or you can sample faster so you still have the large sets of data within that small interval of time.

Another way is too see from your data if it is periodic. Note there is no telling that the signal is periodic unless you are in a stable system and you have been taking an infinite amount of samples. However, if you simply take take a lot of samples, you can say it is periodic with a certain confidence. Let us say it is periodic from n_a to n_b, which means it has a period of n_b-n_a which we shall call T. You could do something on normalization. If your data is periodic, you can try to find mean and standard deviation of the data of a certain time t+kT where t is any point between n_a to n_b, and k is any integer. If you find the mean, you could find the average and literally map at specific times to the expected of output at that specific time. After finding the mean and standard deviation at t+kT, you can split the data into multiple sets of data and multiple sets of distribution at specific times. More information on normalization can be found in the main resources.