Day 2 - Wei-Jun Yin's Blog

Paper 1 title:#

Deterministic and Probabilistic Wind Power Forecasts by Considering Various Atmospheric Models and Feature Engineering Approaches

Executive summary: The authors use three kind of numerical weather prediction (wind speed) rather than the single wind speed from the anemometer to serve as the feature engineering. The kernel of this paper is to use the specific data source to construct the feature and hence. What is useful for me is that we can consider the feature engineering in our work and cite this to alleviate the boring description. For reference, I think the following reference are of value:

A New Fuzzy-Based Combined Prediction Interval for Wind Power Forecasting

Technical description:

PI coverage probability: show the percentage of the probability targets which will be covered by the upper and lower bound.

\[ \mathrm{PICP}=\frac{1}{N} \sum_{t=1}^N c_t \]

where \(N\) is the number of samples and \(c_t\) is a Boolean value that is evaluated as follows:

\[ c_t= \begin{cases}1, & y_t \in\left[L_t, U_t\right] \\ 0, & y_t \notin\left[L_t, U_t\right]\end{cases} \]

where \(y_t\) is the forecast target and \(U_t\) and \(L_t\) are upper and lower bounds of the interval, respectively.

PI normalized average width (PINAW): limit the high extra growth of the interval

\[ \text { PINAW }=\frac{1}{N R} \sum_{t=1}^N\left(U_t-L_t\right) \]

where \(R\) is the range of the underlying targets used for normalizing PIs.

The LUBE method could be regarded as a constrained nonlinear optimization problem with conflicting objective as follows:

Objectives :

Maximize: \(\operatorname{PICP}(w)\) Minimize: PINAW \((w)\) Constraints :

\[ 0 \leq \operatorname{PICP}(w) \leq 100 \% \]

PINAW \((w) \succ 0\)

This is resolved by the following method:

\[ \begin{aligned} & F(X)=\min _{X \in \Omega}\left\{\max _{i=1, \ldots, n}\left|\mu_{\text {ref }, i}-\mu_{f, i}(X)\right|\right\} \\ &=\min _{X \in \Omega}\left\{\operatorname { m a x } \left(\left|\mu_{\text {ref }, \mathrm{PICP}}-\mu_{\mathrm{PICP}}(X)\right|\right.\right. \\ &\left.\left.\left|\mu_{\text {ref }, \mathrm{PINAW}}-\mu_{\mathrm{PNAW}}(X)\right|\right)\right\} \end{aligned} \]

where \(n\) is the number of objectives (here \(n=2\) ), \(\mu_{f, i}\) is the membership function value of the \(i\) th objective, \(\Omega\) is the problem search space; \(X\) is the control vector including the \(\mathrm{NN}\) weighting factors, and \(\mu_{\mathrm{ref}, i}\) is the reference membership value for \(i\)-th objective.

Paper 2 Title:#

Privacy-preserving Spatiotemporal Scenario Generation of Renewable Energies: A Federated Deep Generative Learning Approach

Executive summary: The authors want to use the federated learning with a central server to generate the scenarios for the wind power. The authors use the federated learning and the least square generative adversarial networks (LSGANs) for renewable scenario generation. What I think is useful for me is the concept of scenario generation and the application of federated learning. There are some references that I think is interesting:

Technical description:

Generative adversarial networks: GAN contains discriminator and generator, the generator is used to generated samples and the discriminator is used to judge the input data whether historical data or the generated data as much as possible.

then the output of the discriminator network is

\[ \left\{\begin{array}{l} p_{\text {real }}=D(\boldsymbol{x}) \\ p_{\text {fake }}=D(G(\boldsymbol{z})) \end{array}\right. \]

and the loss function of generated and discriminator are

\[ \begin{gathered} L_G=\mathbb{E}_{\boldsymbol{z} \sim P_Z}[\log (1-D(G(\boldsymbol{z})))] \\ L_D=-\mathbb{E}_{\boldsymbol{x} \sim P_d}[\log D(\boldsymbol{x})]-\mathbb{E}_{\boldsymbol{z} \sim P_d}[\log (1-D(G(\boldsymbol{z})))] \end{gathered} \]

where \(P_Z\) is a known distribution that is easy to sample. Then the mini-max game model with value function \(V_{\mathrm{GANs}}(G, D)\) is given by

\[ \begin{aligned} \min _G \max _D V_{\mathrm{GANs}}(G, D)= & \mathbb{E}_{\boldsymbol{x} \sim P_d}[\log D(\boldsymbol{x})] \\ & +\mathbb{E}_{\boldsymbol{z} \sim P_z}[\log (1-D(G(\boldsymbol{z})))] \end{aligned} \]

Federated Learning: Suppose there are \(N\) clients, i.e. participating edge devices \(\left\{\mathcal{C}_1, \mathcal{C}_2, \ldots, \mathcal{C}_N\right\}\) based on their own dataset \(\left\{\mathcal{D}_1, \mathcal{D}_2, \ldots, \mathcal{D}_N\right\}\)

traditional way: put all data together and train a big model
Federated learning coordinates clients to train a global model \(\mathcal{M}_{\mathrm{FED}}\) deployed on a central server, not collecting all data.

\(\delta\)-accuracy loss: assuming that \(\mathcal{V}_{\text {SUM }}\) and \(\mathcal{V}_{\mathrm{FED}}\) are the performance metrics of the centralized model \(\mathcal{M}_{\text {SUM }}\) and federated model \(\mathcal{M}_{\text {FED }}\) , then \(\left|\mathcal{V}_{\mathrm{SUM}}-\mathcal{V}_{\mathrm{FED}}\right|<\delta\)

Global LSGANs Model:

Traditionally, the generator is fixed, the optimal discriminator is as follows:

\[ D_{G, \mathrm{GANs}}^*(\boldsymbol{x})=\frac{P_d(\boldsymbol{x})}{P_d(\boldsymbol{x})+P_g(\boldsymbol{x})} \]

New: Substitute the above equation into

\[ \begin{gathered} \min _D V_{\mathrm{LSGANs}}(D)=\frac{1}{2} \mathbb{E}_{\boldsymbol{x} \sim P_d}\left[(D(\boldsymbol{x})-b)^2\right]+ \\ \frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim P_Z}\left[(D(G(\boldsymbol{z}))-a)^2\right] \\ \min _G V_{\mathrm{LSGANs}}(G)=\frac{1}{2} \mathbb{E}_{\boldsymbol{z} \sim P_Z}\left[(D(G(\boldsymbol{z}))-c)^2\right] \end{gathered} \]

then we could get

\[ C_{\mathrm{GANs}}(G)=V\left(D_{D, \mathrm{GANs}}^*, G\right)=2 \mathrm{JSD}\left(P_d \| P_g\right)-\log (4) \]

There are some drawbacks of the GAN, then least square-GAN are proposed. use \(a-b\) encoding and the least squares loss function, then the objective function of LSGAN is

The for the generator \(G\), the optimal discriminator \(D\) is

\[ D_{G, \text { LSGANs }}^*(\boldsymbol{x})=\frac{b P_d(\boldsymbol{x})+a P_g(\boldsymbol{x})}{P_d(\boldsymbol{x})+P_g(\boldsymbol{x})} \]

If we choose \(b-c=1\) and \(b-a=2\), then we could get

\[ 2 C_{\mathrm{LSGANs}}(G)=\chi_{\text {Pearson }}^2\left(P_d+P_g \| 2 P_g\right) \]

where \(\chi_{\text {Pearson }}^2\) is the Pearson \(\chi^2\) divergence. If \(b-c=1\) and \(b-a=2\) are satisfied, (8) is equivalent to minimize the Pearson \(\chi^2\) divergence.

network configuration: activation function, ReLU and LeakyReLU activation functions

Then we consider the FederatedAveraging (FedAvg) algorithm. This algorithm is proposed in this paper: Communication-Efficient Learning of Deep Networks from Decentralized Data

The major difference between federated optimization and distribution optimization.

Non-IID: any particular user's local dataset will not be representative of the population distribution
Unbalanced: Some users will make much heavier use of the service
Massively distributed:
Limited communication, Mobile devices are frequently offline or on slow or expensive connection.

\[ \min _{w \in \mathbb{R}^d} f(w) \quad \text { where } \quad f(w) \stackrel{\text { def }}{=} \frac{1}{n} \sum_{i=1}^n f_i(w) \]

For a machine learning problem, we typically take

\[ f_i(w)= \ell\left(x_i, y_i ; w\right) \]

We assume that there are \(K\) clients and \(\mathcal{P}_k\) the set of indexes of data points on client \(k\), with \(n_k = \lvert \mathcal{P}_k \rvert\), then

\[ f(w)=\sum_{k=1}^K \frac{n_k}{n} F_k(w) \quad \text { where } \quad F_k(w)=\frac{1}{n_k} \sum_{i \in \mathcal{P}_k} f_i(w) \text {. } \]

three key parameters: \(C\), the fraction of clients that perform computation on each round; \(E\), then number of training passes each client makes over its local dataset on each round; and \(B\), the local minibatch size used for the client updates. We write \(B=\infty\) to indicate that the full data are used.

\[ \begin{aligned} & \hline \text { Algorithm } 1 \text { FederatedAveraging. The } K \text { clients are } \\ & \text { indexed by } k ; B \text { is the local minibatch size, } E \text { is the number } \\ & \text { of local epochs, and } \eta \text { is the learning rate. } \\ & \hline \text { Server executes: } \\ & \text { initialize } w_0 \\ & \text { for each round } t=1,2, \ldots \text { do } \\ & \quad m \leftarrow \max (C \cdot K, 1) \\ & \quad S_t \leftarrow(\operatorname{random} \text { set of } m \text { clients) } \\ & \quad \text { for each client } k \in S_t \text { in parallel do } \\ & \quad w_{t+1}^k \leftarrow \text { ClientUpdate }\left(k, w_t\right) \\ & \quad w_{t+1} \leftarrow \sum_{k=1}^K \frac{n_k}{n} w_{t+1}^k \\ & \text { ClientUpdate }(k, w): / / \text { Run on client } k \\ & \mathcal{B} \leftarrow\left(\text { split } \mathcal{P}_k \text { into batches of size } B\right) \\ & \text { for each local epoch } i \text { from } 1 \text { to } E \text { do } \\ & \text { for batch } b \in \mathcal{B} \text { do } \\ & \quad w \leftarrow w-\eta \nabla \ell(w ; b) \\ & \text { return } w \text { to server } \end{aligned} \]

Then the algorithm could be (there is no contribution, except the GANs optimization part.)

Correlation Analysis:

\[ R(\tau)=\frac{\mathbb{E}\left[\left(S_t-\mu\right)\left(S_{t+\tau}-\mu\right)\right]}{\sigma^2} \]

where \(S\) is a random time series; \(\mu\) and \(\sigma\) denote the mean and variance of \(S\), respectively; and \(\tau\) is the time lag.

We use the continuous ranked probability score (CRPS) which measures the dissimilarity of the cumulative distributions between generated scenarios and historical observations.

The score at lead time \(\ell\) is defined as

\[ \operatorname{CRPS}_l=\frac{1}{M} \sum_{t=1}^M \int_0^1\left(\widehat{F}_{t+l \mid t}(\xi)-\mathbf{1}\left(\xi \geq \xi_{t+l}\right)\right)^2 d \xi \]

where \(M\) is the total number of scenarios, \(\widehat{F}_{t+l \mid t}(\xi)\) denotes the cumulative distribution function of normalized scenario, and \(\mathbf{1}\left(\xi \geq \xi_{t+l}\right)\) is the indicator function for comparing scenarios and observation.

Fréchet inception Distance (FID)

\[ \operatorname{FID}\left(P_d, P_g\right)=\left\|\mu_d-\mu_g\right\|+\operatorname{Tr}\left(\Sigma_d+\Sigma_g-2\left(\Sigma_d \Sigma_g\right)^{\frac{1}{2}}\right) \]

where \(\mu_d\) and \(\mu_g\) represent the empirical mean; \(\Sigma_d\) and \(\Sigma_g\) are empirical covariance.

Kernel Maximum Mean Discrepancy (MMD): measures the difference between \(P_d\) and \(P_g\) for some fixed kernel function \(k\), which is defined as

\[ \operatorname{MMD}^2\left(P_d, P_g\right)=\underset{\substack{x, x^{\prime} \sim P_d \\ y, y^{\prime} \sim P_g}}{ }\left[k\left(x, x^{\prime}\right)-2 k(x, y)+k\left(y, y^{\prime}\right)\right] \]

The 1-Nearest Neighbor classifier

Energy Score (ES)

\[ \mathrm{ES}=\frac{1}{M} \sum_{i=1}^M\left\|\varsigma-\xi_i\right\|-\frac{1}{2 M^2} \sum_{i=1}^M \sum_{j=1}^M\left\|\xi_i-\xi_j\right\| \]

\(\varsigma\) is the real renewable power output, \(\xi_i\) is the \(i\)-th generated time series scenario and \(M\) denotes the number of scenarios.

Pearson correlation coefficient \(\rho\) of two time series \(S_i\) and \(S_j\) is

\[ \rho\left(S_i, S_j\right)=\frac{\sum_{i=1}^n\left(S_i-\bar{S}_i\right)\left(S_j-\bar{S}_j\right)}{\sqrt{\sum_{i=1}^n\left(S_i-\bar{S}_i\right)^2} \sqrt{\sum_{i=1}^n\left(S_j-\bar{S}_j\right)^2}} \]