We start by defining some basic notation:
\(G\): finite graph on vertices \(V\)
\(q \in \mathbb{N}\), we are interested in \(q\)colourings of the vertices in \(G\)
\(m(\chi)\): number of monochromatic edges induced by a colouring \(\chi\)
Distribution on colourings given by \(p(\chi) \propto exp(\beta \cdot m(\chi))\)
\(\beta \in \mathbb{R}\): parameter, inverse temperature
Notice that for \(\beta < 0\) it means that we take the antiferromagnetic case. Here we talk more about when \(\beta > 0\) meaning it is ferromagnetic.
This could have quite some applications:
Modelling: Social networks, physics, chemistry, etc
Markov Random field: Probabilistic Inference
Connection to UGC [^coulson_et_al]
and more.
we know \(p(\chi) \propto exp(\beta \cdot m(\chi))\)
Now for \(\beta = 0\) it means that we are doing a uniform \(q\)coloring of \(V\)
For \(\beta = \infty\) we do a uniform proper coloring of \(G\)
What we need to do is given \(G\) and \(\beta\), efficiently sample a coloring from this distribution.
$$p(\chi) = \frac{exp(\beta m(\chi))}{\sum_{\chi}exp(\beta m(\chi))}$$
We add the normalizing factor here:
$$\text{Nomalizing factor } = \sum_{\chi}exp(\beta m(\chi))$$
Now we can also say,
$$\sum_{\chi}exp(\beta m(\chi)) =: Z_G(q,\beta)$$
A partition function of the model/distribution is very important for this POV. Our problem is that given \(G\) and \(\beta\) we want to efficiently sample a color distribution.
We give 2 facts:
It is enough to compute \(Z_G(q,\beta)\)
#Phard
We now modify the problem as: Given \(G\) and \(\beta\), efficiently sample \textbf{approximately} a colouring from this distribution.
\(\epsilon\) approximation will have us sample a law from \(q\) such that \(\mid \mid pq\mid \mid _{TVD} \leq \epsilon\), thus
$$\mid \mid pq\mid \mid {TVD} := \frac{1}{2} \sum{\chi} \mid p(\chi)  q(\chi)\mid$$
We modify our original problem template to now be: Given \(G\) and \(\beta\), efficiently sample \(\epsilon\)\textbf{approximately} a colouring from this distribution.
Fully Polynomial Almost Uniform Sampler can allow us to sample \(\epsilon\)approximately in \(poly(G,\frac{1}{\epsilon})\) time.
Instead Fully Polynomial Time Approximation Scheme: \(1 \pm \epsilon\)factor approximation in \(poly(G,\frac{1}{\epsilon})\) time.
We can also show for a fact that \(FPTAS \iff FPAUS\).
The Antiferromagnetic Potts model:
$$p(\chi) \propto exp{\beta \cdot m(\chi)}$$
where \(\beta < 0\)
Given \(G\) and \(\beta < 0\), we want to be able to give an FPAUS for this distribution. It is then equivalent to instead work on the problem: given \(G\) and \(\beta < 0\), give an FPTAS for its partition function \(Z_G(q, \beta)\).
From some previous work, we know that there exists a \(\beta_c\) such that:
for \(\beta < \beta_c\), FPTAS exists
For \(\beta < \beta_c\), no FPTAS unless \(NP = RP\)
We can say that this is #BIShard (bipartite independent sets). Thus, doing this is at least as hard as an FPTAS for the number of independent sets in bipartite graphs. If our graph has no bipartiteness then this becomes a NPhard problem.
For now, let's consider the problem given a bipartite graph \(G\), design an FPTAS for the number of individual sets in \(G\). This accurately captures the difficulty of: the number of proper \(q\)colorings of a bipartite graph for \(q \geq 3\), the number of stable matchings, the number of antichains in posets.
For our purposes we assume that \(G\) is always a \(d\)regular graph on \(n\) vertices. Now for a set \(S \subset V\), we define it's edge boundary as:
$$\triangledown(S) := # (uv \in G \mid u \in S, v \notin S)$$
Now, \(G\) is an \(\eta\) expander if for every \(S \subset V\) of size at most \(n/2\), we have \(\mid \triangledown(S)\mid \geq \eta\mid S\mid \) . For example we can take a discrete cube \(Q_d\) with vertices \({0,1}^d\), \(uv\) is an edge if \(u\) and \(v\) differ in exactly 1 coordinate.
Using a simplification of the Harper's Theorem we can say that \(Q_d\) is a \(1\)expander [^frankl1981short].
Theorem: For each \(\epsilon > 0\) and there is a \(d=d(\epsilon)\) and \(q = q(\epsilon)\) such that there is an FPTAS for \(Z_G(q, \beta)\) where \(G\) is a \(d\)regular \(2\)expander providing the following conditions hold:
\(q=poly(d)\)
\(\beta \notin (2 \pm \epsilon)\frac{ln(q)}{d}\)
The main result shown was that
Theorem: For each \(\epsilon > 0\), and \(d\) large enough, there is an FPTAS for \(Z_G(q, \beta)\) where \(G\) for the class of \(d\)regular trianglefree \(1\)expander grpahs providing the following conditions hold:
\(q \geq poly(d)\)
\(\beta \notin (2 \pm \epsilon)\frac{ln(q)}{d}\)
This was previously known for:
Stronger expansion and \(d = q^{\Omega(d)}\)
Higher temperature and \(q = d^{\Omega(d)}\)
Something to note here is that \(q \geq poly(d)\) should not be a necessary condition.
As well as as in the case \(\beta \leq (1\epsilon) \beta_0\) does not require expansion or even that \(q \geq poly(d)\).
We first write the orderdisorder threshold of the ferromagnetic Potts model
$$\beta_0 := ln\left(\frac{q2}{(q1)^{12/d}  1}\right)$$
$$\beta_0 = 2 \frac{\ln q}{d} \left(1+O \left( \frac{1}{q} \right)\right)$$
We want to be able to know more about how the Potts distribution looks for \(\beta < (1\epsilon)\beta_0\) and for \(\beta > (1+\epsilon)\beta_0\)
Another result we have is:
Theorem: For each \(\epsilon>0\), let \(d\) be large enough \(q \geq poly(d)\), and \(G\) be a \(d\)regular \(2\)expander graph on \(n\) vertices then,
For \(\beta < (1\epsilon) \beta_0\), every colour class has size \(n/q (1 \pm o(1))\) with high probability
For \(\beta > (1+\epsilon) \beta_0\), every colour class has size \(no(n)\) with high probability
The strategy we have, to prove the theorem for \(\beta < (1\epsilon) \beta_0\):
Pass to the Random Cluster Model
Distribution on subsets of edges: \(p(A) \propto q^{k(A)} (e^{\beta}1)^{\mid A\mid }\)
\(Z_G^{RC}(q, \beta) = Z_G^{Potts}(q, \beta)\)
Sampling algorithm: Sample from random cluster model, give each connected component a uniform color
Standard polymer methods + careful enumeration
The motivating idea is to visualize the state for \(\beta\) large at low temperature as ground state + defects.
Typical Colouring = Ground State + Defects
Polymer methods are pretty useful in such cases. These were first proposed in [^helmuth2019] and originated in statistical physics. We take \(G\) to be our defect graph and each node in this represents a defect.
Now using Polymer methods \(X \sim _GY\)
Ideas is to \(ZG(q,\beta) \sim Z{red} + Z{blue}+\dots\) where \(Z{red} \approx e^{\beta nd/2}\)
\(Z{red} e^{\beta nd/2} = \sum{I \subset V(G)} \prod{\gamma \in I}w{\gamma}\) where \(w_{\gamma}\) is the weight of polymer \(\gamma\).
We now move towards cluster expansion: multivariate in the \(w_{\gamma}\) Taylor expansion of:
$$ln(\sum{I \subset V(G)} \prod{\gamma \in I}w_{\gamma})$$
This is an infinite sum, so convergence is not guaranteed however convergence can be established by verifying the KoteckPreiss criterion.
We also want to answer how many connected subsets are there of a given edge boundary in an \(\eta\)expander?
A heuristic we have is to count the number of such subsets that contain a given vertex \(u\): a typical connected subgraph of size \(a\) is treelike, i.e., has edge boundary \(a \cdot d\).
Working backward, a typically connected subgraph with edge boundary size \(b\) has \(O(b/d)\) vertices. The number of such subgraphs \(\leq\) number of connected subgraphs of size \(O(b/d)\) containing \(u\). The original number of subsets is also \(\leq\) Number of rooted (at \(u\)) trees with \(O(b/d)\) vertices and maximum degree at most \(d = d^{O(b/d)}\). Thus,
Theorem: At most \(d^{O(1+1/\eta)b/d}\) connected subsets in an \(\eta\) expander that contains \(u\) have edge boundary of size at most \(b\).
Another question to ask is how many \(q\)colorings of an \(\eta\)expander induce at most \(k\) nonmonochromatic edges?
Easiest way is to make \(k\) nonmonochrimatic edges is to color all but \(k/d\) randomly chosen vertices with the same color. Now, \(k\) small \(\implies\) these vertices likely form an independent set. we now color these \(k/d\) vertices arbitrarily. There are:
$${n \choose k/d} q^{k/d+1}$$
ways.
Theorem: For \(\eta\)expanders and \(q \geq poly(d)\) there are at most \(n^4 q^{O(k/d)}\) possible colourings.
Now we also know the maximum value of \(ZG(q,\beta)\) over all graphs \(G\) with \(n\) vertices, \(m\) edges, and max degree \(d\). This will always be attained when \(G\) is a disjoint union of \(K{d+1}\) and \(K_1\)
[^carlson2022algorithms]: Carlson, Charlie, et al. "Algorithms for the ferromagnetic Potts model on expanders." 2022 IEEE 63rd Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 2022.
[^coulson_et_al]: Coulson, Matthew, et al. "Statistical physics approaches to Unique Games." arXiv preprint arXiv:1911.01504 (2019).
[^helmuth2019]: Helmuth, Tyler, Will Perkins, and Guus Regts. "Algorithmic pirogovsinai theory." Proceedings of the 51st Annual ACM SIGACT Symposium on Theory of Computing. 2019.
[^frankl1981short]: Frankl, Peter, and Zoltan Furedi. "A short proof for a theorem of Harper about Hammingspheres." Discrete Mathematics 34.3 (1981): 311313.
]]>I will first review some standard notation:
\(Z\): domain set
\(D_Z\): distribution over \(Z\)
\(S\): i.i.d sample from \(D_Z\)
\(H\): class of models/hypotheses
\(L(D_Z,H) \to \mathbb{R}\): loss/ error function
\(OPT=inf_{h \in H} L(D_z,h)\): best achievable
\(A_{Z,H}: Z^* \to H\): learner
Our goal for the task of density estimation is that for every \(DZ\), \(A{Z,H}(S)\) we want it to be comparable to \(OPT\) with high probability.
We take the example of density estimation, in this case, \(L(DZ, h) = d{TV} (Dz, h)\). Now, \(A{Z,H}\) probably approximately correct learns \(H\) with \(m(\epsilon, \delta)\) samples if for all \(D_Z\) and for all \(\epsilon\) with a \(\delta \in ( 0,1 )\). Now if \(S \sim D_Z^{m(\epsilon, \delta)}\) then:
$$\underset{S}{\mathrm{Pr}} [L(DZ, A{Z,H}(S)) > \epsilon + C \cdot OPT] < \delta$$
Now if we take the example of \(C=2\), let \(H\) be the set of all Gaussians in \(\mathbb{R}^d\) then:
$$m(\epsilon, \delta) = O \left( \frac{d^2 + \log 1/\delta}{\epsilon^2} \right)$$
We will now modify the above equation. Now, \(A_{Z,H}\) probably approximately correct learns \(H\) with \(m(\epsilon)\) samples if for all \(D_Z\) and for all \(\epsilon \in ( 0,1 )\). Now if \(S \sim D_Z^{m(\epsilon)}\) then:
$$\underset{S}{\mathrm{Pr}} [L(DZ, A{Z,H}(S)) > \epsilon + C \cdot OPT] < 0.01$$
For the example of \(C=2\), let \(H\) be the set of all Gaussians in \(\mathbb{R}^d\) then:
$$m(\epsilon, \delta) = O \left( \frac{d^2}{\epsilon^2} \right)$$
For the example of binary classification, we have \(Z = X \times { 0, 1 }\) and \(h\) is some model which maps from \(h: X \to { 0, 1 }\).
We also have \(l(h,x,y) = 1 {h(x) \neq y}\) and then we will have the \(L\) be \(L(DZ,h) = E{(x,y) \sim D_Z} l(h,x,y)\).
Now, \(A_{Z,H}\) probably approximately correct learns \(H\) with \(m(\epsilon)\) samples if for all \(D_Z\) and for all \(\epsilon \in ( 0,1 )\). Now if \(S \sim D_Z^{m(\epsilon)}\) then:
$$\underset{S}{\mathrm{Pr}} [L(DZ, A{Z,H}(S)) > \epsilon + C \cdot OPT] < 0.01$$
Now \(H\) is the set of all half spaces in \(\mathbb{R}^d\) then:
$$m(\epsilon) = O \left( \frac{d}{\epsilon^2} \right)$$
For the example of binary classification, we have \(Z = X \times { 0, 1 }\) and \(h\) is some model which maps from \(h: X \to { 0, 1 }\).
We also have \(l^U(h,x,y) =\) adversarial perturbations and then we will have the \(L^U\) be \(L^U(DZ,h) = E{(x,y) \sim D_Z} l^U(h,x,y)\).
Now, \(A_{Z,H}\) probably approximately correct learns \(H\) with \(m(\epsilon)\) samples if for all \(D_Z\) and for all \(\epsilon \in ( 0,1 )\). Now if \(S \sim D_Z^{m(\epsilon)}\) then:
$$\underset{S}{\mathrm{Pr}} [L^U(DZ, A{Z,H}(S)) \epsilon + OPT] < 0.01$$
Now, consider the following \(H_1\) and \(H_2\):
Here \(H_2\) is richer which can make it contain better models as well as harder to learn. We can characterize the sample complexity of learning \(H\) using Binary classification, with Binary classification with adversarial perturbations or with Density estimation.
In the case of binary classification the VCdimension quantifies complexity:
$$m ( \epsilon ) = \Theta \left( \frac{VC(H)}{\epsilon^2} \right)$$
The upper bound here is achieved using simple ERM
$$L(S,h) = \frac{1}{S} \sum_{(x,y)\in S} l(h,x,y)$$
$$h = argmin_{h \in H} L(S,h)$$
And then for uniform convergence:
$$sup_{h \in H} L(S,h)  L(D_z,h) = O \left( \sqrt{\frac{VC(H)}{S}} \right) w.p. > 0.99$$
We now introduce sample compression as an alternative.
The idea is to try and answer how should we go about compressing a given training set? In classic information theory, we would compress it into a few bits. In the case of sample compression, we want to try to compress it into a few samples.
If we just take the simple example of linear classification Number of required bits is unbounded (depends on the sample).
It has already been shown by ^littlestone1986relating that Compressibility \(\implies\) Learnability
$$m(\epsilon)=\tilde{O} \left( \frac{k}{\epsilon^2} \right)$$
It has also been shown by ^moran2016sample, Compressibility \(\impliedby\) Learnability
$$k = 2^{O(VC)}$$
Compression Conjecture:
\(k = \Theta(VC)\)
Sample Compression can be very helpful by:
being simpler and more intuitive
being more generic. It can work even if uniform convergence fails! Can show optimal SVM bound and we can also perform compression for learning under Adversarial Perturbations.
Typical classifiers are often:
Sensitive to "adversarial" perturbations, even when the noise is "imperceptible"
Vulnerable to malicious attacks
Ignore the "invariance" or domainknowledge
In the case of classification with adversarial perturbations we had \(l^{0/1}(h,x,y) = 1 {h(x) \neq y}\) and \(l^U(h,x,y) = sup_{\bar{x} \in U(x)} l^{0/1}(h,\bar{x},y)\)
and then we will have the \(L^U\) be \(L^U(DZ,h) = E{(x,y) \sim D_Z} l^U(h,x,y)\).
Now, \(A_{Z,H}\) probably approximately correct learns \(H\) with \(m(\epsilon)\) samples if for all \(D_Z\) and for all \(\epsilon \in ( 0,1 )\). Now if \(S \sim D_Z^{m(\epsilon)}\) then:
$$\underset{S}{\mathrm{Pr}} [L^U(DZ, A{Z,H}(S)) \epsilon + OPT] < 0.01$$
However one of the problems with this is if the robust ERM works for all \(H\)
$$L(S,h) = \frac{1}{S} \sum_{(x,y)\in S} l(h,x,y)$$
$$h = argmin_{h \in H} L(S,h)$$
The robust ERM would not work for all \(H\), uniform convergence can fail,
$$sup_{h \in H} L^U(S,h)  L^U(D_z, h)$$
can be unbounded.
We can say that any "proper learner" (outputs from \(H\)) can fail.
In a compressionbased method the decoder should recover the labels of the training set and their neighbors and then compress the inflates set:
$$k = 2^{O(VC)}$$
So,
$$m^U(\epsilon) = O \left( \frac{2^{VC(H)}}{\epsilon^2} \right)$$
There is an exponential dependence on \(VC(H)\).
^ashtiani23a introduced tolerant adversarial learning \(A_{Z,H}\) PAC learns \(H\) with \(m(\epsilon)\) samples
if \(\forall D_Z\), \(\forall \epsilon \in (0,1)\), if \(S \simeq D_Z^{m(\epsilon)}\) then
$$Pr_S[L^U(DZ,A{Z,H}(S)) > \epsilon + inf_{h \in H} L^V(DZ,A{Z,H}(S))] < 0.01$$
And,
$$m^{U,V}(\epsilon) = \tilde{O} \left( \frac{VC(H)d\log(1+\frac{1}{\gamma})}{\epsilon ^2} \right)$$
The trick is to avoid compressing an infinite set and now our new goal is that the decoder should only recover labels of things in \(U(x)\).
To do so we can define a noisy empirical distribution (using \(V(x)\)) and then use boosting to achieve a super small error with respect to this distribution. And then, we encode the classifier using the samples used to train weak learners and the decoder smooths out the hypotheses.
It is interesting to think of Why do we need tolerance? There do exist some other ways to relax the problem and avoid \(2^{O(VC)}\)
bounded adversary
Limited blackbox query access to the hypothesis
Related to the certification problem
This is also observable in the density estimation example.
Gaussian mixture Models are very popular in practice and are one of the most basic universal density approximators. These are also the building blocks for more sophisticated density classes and can think of them as multimodal versions of Gaussians.
$$f(x) = w_1N(x\mu_1,\sum 1)+w_2N(x\mu_2,\sum 2)+w_3N(x\mu_3,\sum 3)$$
We say \(F\) is Gaussian Mixture Model with \(k\) components in \(\mathbb{R}^d\). And we want to ask how many samples is needed to recover \(f \in F\) within \(L_1\) error \(\epsilon\).
The number of samples \(\simeq m(d,k,\epsilon)\).
To learn single Gaussian in \(\mathbb{R}^d\) then
$$O \left( \frac{d^2}{\epsilon^2} \right) = O \left( \frac{# params}{\epsilon^2} \right)$$
samples are sufficient (and necessary).
Now if we have \(k\) Gaussian in \(\mathbb{R}^d\) then we want to know if
$$O \left( \frac{kd^2}{\epsilon^2} \right) = O \left( \frac{ \text{number of params}}{\epsilon^2} \right)$$
samples are sufficient?
There have been some results on learning Gaussian Mixture Models.
Let us take the example of this graph. For a moment look at this as a binary classification problem. The decision boundary has a simple quadratic form!
$$VCdim=O(D^2)$$
Here "Sample compression" does not make sense as there are no "labels".
We have \(F\) which is a class of distributions (e.g. Gaussians) and we have. If A sends \(t\) points from \(m\) points and B approximates \(D\) then we say \(F\) admits \((t,m)\)compression.
Theorem:
If \(F\) has a compression scheme of size \((t,m)\) then sample complexity of learning \(F\) is
$$\tilde{O} \left( \frac{t}{\epsilon^2} + m \right)$$
\(\tilde{O}(\cdot)\) hides polylog factors.
Small compression schemes imply sampleefficient algorithms.
Theorem:
If \(F\) has a compression scheme of size \((t, m)\) then \(k\) mixtures of \(F\) admits \((kt,km)\) compression.
Distribution compression schemes extend to mixture classes automatically! So for the case of GMMs in \(\mathbb{R}^d\) it is enough to come up with a good compression scheme for a single Gaussian!
For learning mixtures of Gaussians, the encoding center and axes of ellipsoid is sufficient to recover \(N(\mu,\Sigma)\). This admits \(\tilde{O}(d^2,\frac{1}{\epsilon})\) compression! The technical challenge is encoding the \(d\) eigenvectors "accurately" using only \(d^2\) points.
\( \frac{\sigma{max}}{\sigma{min}} \) can be large which is a technical challenge.
Compression is simple, intuitive, generic
Compression relies heavily on a few points
But still can give "robust" methods
Agnostic sample compression
Robust target compression
Target compression is quite general
Reduces the problem to learning from finite classes
Does it characterize learning?
So, I thought I would share my reviews of the papers I read as a blog series. These might not be perfect but I have put in quite some work in writing the reviews.
Here, I review the paper Domain Generalization using Causal Matching [^review]. I believe for anyone to benefit or follow along the blog, it might be best to first read the paper themselves and then read the review.
In this paper, the authors examine the task of domain generalization, in which the goal is to train a machine learning model that can perform well on new, unseen data distributions. The authors propose a new algorithm \(MatchDG\) which is a contrastive learning method that tries to approximately identify the object from which the input had been derived. The paper refers to this as a match. The paper relies on trying to have the same representation for inputs across domains if they are derived from the same object.
They first introduce the perfect match approach which involves learning representations that satisfy the invariance criteria and are informative of a label across domains. The perfect match case is applicable when we have self augmentations in which case we get a perfect base object. As expected the authors report strong performance of Rotated MNIST with perfect matching.
However, in most cases, we do not know the perfect matches across domains in which case the authors propose finding approximate matches without known objects. To learn this matching function, \(MatchDG\) optimizes a contrastive representation learning loss. Positive matches here mean data points from a different domain that share the same class label as the anchor. The key difference in the training of this algorithm is updating positive matches: after every \(t\) epochs, the positive matches inferred using the learned matching function are updated during training based on the nearest sameclass data points in the representation space. The intuition for this could be thought of as accounting for the intraclass variance across domains. After this, the learned matching function is substituted in the perfect match loss function.
For a finite number of domains m, as the number of examples in each domain \(n_d\rightarrow \infty\)
The set of representations that satisfy the condition \(\sum_{\Omega(j,k)=1; d\neq d'} dist(\phi(x_j^{(d)}), \phi(x_k^{(d')})) =0\) contains the optimal \(\phi(x)=X_C\) that minimizes the domain generalization loss.
Assuming that \(P(X_a \mid O, D)<1\) for every highlevel feature \(X_a\) that is directly caused by domain, and for Padmissible loss functions whose minimization is conditional expectation (e.g., \(\ell_2\) or crossentropy), the true function \(f^\) is one of the optimal solutions, a lossminimizing classifier for the following loss is the true function \( f^ \), for some value of \(\lambda\) provided that \( f^* \in F_c\subseteq F \).
$$f{perfectmatch} = \arg \min{h, \phi} \sum_{d=1}^{m} Ld(h(\phi(X)), Y) + \lambda \sum{\Omega(j,k)=1; d\neq d'} dist(\phi(x_j^{(d)}), \phi(x_k^{(d')}))$$
However, this might not always be true. You can theoretically generate data such that you can learn the specific transformations and minimize the objective to be 0. For example, this statement will most certainly not hold true if we are given the liberty to choose a subset of the data from RotatedMNIST. However, in most cases minimizing for \(C(\phi)\) does imply better generalizability.
The intuition for using a twophase approach for \(MatchDG\) is well motivated and contrasted with previous approaches as well as backed by experiments. It might be very helpful to signify the use of a twophase approach with some more ablation studies.
Might be helpful to investigate more on the poor performance of Chest XRays dataset and experiments on some other large datasets.
Found this work by Jiles et al. [^jiles2022re] that showed some disrepancies in evaluation.
4 excellent
4 excellent
7: Accept: Technically solid paper, with high impact on at least one subarea of AI or moderatetohigh impact on more than one area of AI, with goodtoexcellent evaluation, resources, reproducibility, and no unaddressed ethical considerations.\
4: You are confident in your assessment, but not absolutely certain. It is unlikely, but not impossible, that you did not understand some parts of the submission or that you are unfamiliar with some pieces of related work.
[^review]: Mahajan, Divyat, Shruti Tople, and Amit Sharma. "Domain generalization using causal matching." International Conference on Machine Learning. PMLR, 2021.
[^jiles2022re]: D JILES, R. I. C. H. A. R. D., and Mohna Chakraborty. "[Re] Domain Generalization using Causal Matching." ML Reproducibility Challenge 2021 (Fall Edition). 2022.
]]>When you're starting a new project, you might have to add multiple linting tools to beautify your code and prevent simple errors.
You will often use multiple linters one of them might support an npm installation and other one might have a PyPI installation and so on. You will also want to set up some automation in your CI to run these linters, but this process is quite tedious 😫.
In this article, I will show you how you can use GitHub Super Linter, a single linter to solve all these problems. Most of my personal projects use GitHub Super Linter as well, and I have personally found it be a huge lifesaver.
Linting is essentially a form of static code analysis. It analyzes the code you wrote against some rules for stylistic or programmatic errors. Think of it as a tool that flags suspicious usage in software.
A linter can help you save a lot of time by:
Given the useful nature of linting tools, you would ideally want to run a linter before any code reviews happen on every single piece of code that is pushed to your repository. This definitely helps you write better, more readable, and more stable code.
Here is an example of using Black, a linting tool for Python focusing on code formatting.
Formatting changes made by Black
GitHub Super Linter can help you quite a lot in bringing these capabilities to your projects easily and efficiently. GitHub Super Linter is a combination of multiple commonly used linters which you can use very easily. It lets you set up automated runs for these linters, as well as manage multiple linters in a single project!
There is also a ton of customization capabilities with environment variables that can help you customize the Super Linter to your individual repository.
Super Linter is primarily designed to be run inside a GitHub Action, which is also how I have been using it for quite some time. We will talk about this first. To follow along you should create a new GitHub Action in your repository. Let us create a new file at .github/workflows/linter.yml
.
Going forward, I will assume you know about the basic syntax for GitHub Actions. But in case you do not or need a quick refresher, I would suggest going through this Quick Start Guide.
We already have a blank file .github/workflows/linter.yml
, which we will now populate with an action you can use to lint your project.
We will start by giving our action a name. This is what appears under the GitHub Action Status Check:
name: Lint Code Base
Next up, let's specify triggers for our action. This answers the question of when you should lint your codebase. Here we tell it to run the lint on every push and on every pull request.
name: Lint Code Base
on: [push, pull_request]
This is another very commonly used configuration for the triggers. This only runs when you make a pull request to main
or master
branches and not on pushes to these branches.
on:
push:
branchesignore: [master, main]
pull_request:
branches: [master, main]
Next up, we want to set up a job. All the components you put in a single job will run sequentially. Here, think of it as the steps and in which order we want them to run whenever the trigger is satisfied.
We will name this job "Lint Code Base" and ask GitHub to run our job on a runner with the last version of Ubuntu supported by GitHub.
name: Lint Code Base
on: [push, pull_request]
jobs:
build:
name: Lint Code Base
runson: ubuntulatest
You are not bound to using a single kind of runner (ubuntulatest), like we do here. It's a common practice to have a matrix of agent kinds, but in this case it will run the same way on all kinds of runners. You often you use a matrix of runners to test that your code works well on all kinds of platforms.
GitHub Super Linter doesn't work any differently on other machine types so we just use a single machine type.
Next up, we will start defining the steps we want this workflow to have. We essentially have two steps:
On to checking out the code. To do this, we will use the official checkout action by GitHub.
We will set fetchdepth: 0
to fetch all history for all branches and tags which is required for Super linter to get a proper list of changed files. If you hadn't done this, only a single commit would be fetched.
We also give our step a name and tell it we want to use the action present in the GitHub repository at actions/checkout@v3
.
name: Lint Code Base
on: [push, pull_request]
jobs:
build:
name: Lint Code Base
runson: ubuntulatest
steps:
 name: Checkout Code
uses: actions/checkout@v3
with:
fetchdepth: 0
This piece of code checks out your repository under $GITHUB_WORKSPACE
which allows the rest of the workflow to access this repository. The repository we are checking out is the one where your code resides, ideally the same repository.
Now we'll add the step to run the linter since we have our code checked out. You can customize GitHub Super Linter using environment variables when running the action.
name: Lint Code Base
on: [push, pull_request]
jobs:
build:
name: Lint Code Base
runson: ubuntulatest
steps:
 name: Checkout Code
uses: actions/checkout@v3
with:
fetchdepth: 0
 name: Lint Code Base
uses: github/superlinter@v4
We will now talk about the environment variables which you will often be using with GitHub Super Linter as well as some examples.
VALIDATE_ALL_CODEBASE
: this decides whether Super Linter should lint the whole codebase or just the changes introduced with that commit. These changes are found out using git diff
, but you can also change the search algorithm (but we will not be looking into this in this article). Example: VALIDATE_ALL_CODEBASE: true
.GITHUB_TOKEN
: As the name suggests, this is the value of the GitHub token. If you use this, GitHub will show up each of the linters you use (we will see how to do that soon) as separate checks on the UI. Example: In GitHub Actions you can use GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
.DEFAULT_BRANCH
: The name of the repository default branch. Example: DEFAULT_BRANCH: main
.IGNORE_GENERATED_FILES
: In case you have any files which are generated by tools, you could mark them as @generated
. If this environment variable is set to true, Super Linter ignores these files. Example: IGNORE_GENERATED_FILES: true
.IGNORE_GITIGNORED_FILES
: Excludes the files which are in .gitignore
from linting. Example: IGNORE_GITIGNORED_FILES: true
.LINTER_RULES_PATH
: A custom path where any linter customization files should be. By default your files are expected to be at .github/linters/
. Example: LINTER_RULES_PATH: /.
These are some of the environment variables you will use the most often, but none of the ones we've discussed yet talk about languagespecific linting.
If you do not use any of the environment variables we talk about, Super Linter automatically finds and uses all the applicable linters for your codebase.
You will often only be interested in using specific linters for your projects. You can use the following environment variable pattern to add any linters you want:
VALIDATE_{LANGUAGE}_{LINTER}
You can find the naming conventions for these in the list of Supported Linters.
Here are a couple of examples, where we specify we want to use Black to lint all Python files, ESLint for JavaScript files, and HTMLHint for HTML files.
name: Lint Code Base
on: [push, pull_request]
jobs:
build:
name: Lint Code Base
runson: ubuntulatest
steps:
 name: Checkout Code
uses: actions/checkout@v3
with:
fetchdepth: 0
 name: Lint Code Base
uses: github/superlinter@v4
 name: Lint Code Base
uses: github/superlinter@v4
env:
VALIDATE_ALL_CODEBASE: true
VALIDATE_JAVASCRIPT_ES: true
VALIDATE_PYTHON_BLACK: true
VALIDATE_HTML: true
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Once you set one of the linters to true
, all the other linters will not run. In the above snippet, none of the linters except ESLint, Black, or HTMLHint will run.
However, in this example, we set a single linter to false
so every linter except ESLint will run here:
name: Lint Code Base
on: [push, pull_request]
jobs:
build:
name: Lint Code Base
runson: ubuntulatest
steps:
 name: Checkout Code
uses: actions/checkout@v3
with:
fetchdepth: 0
 name: Lint Code Base
uses: github/superlinter@v4
 name: Lint Code Base
uses: github/superlinter@v4
env:
VALIDATE_ALL_CODEBASE: true
VALIDATE_JAVASCRIPT_ES: false
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Linters often use configuration files so you can modify the rules the linter uses. In the above two full length examples I showed, Super Linter will try to find any configuration files under .github/linters/
.
These could be your .eslintrc.yml
file used to configure ESLint, .htmlhintrc
for configuring HTMLHint, and so on.
Here is an example of the configuration file if you use the Flake8 linter for Python:
[flake8]
maxlinelength = 120
You save this at .github/linters/.flake8
. You'll then use it while running the Flake8 linter. You can find an example of template configuration files you can use here.
However, here are two examples of how you can modify this path:
Add the directory path as an environment variable like this:
LINTER_RULES_PATH: configs/
You can also hardcode a path for a specific linter as an environment variable. Here is an example:
JAVASCRIPT_ES_CONFIG_FILE: configs/linters/.eslintrc.yml
GitHub Super Linter was built to be run inside GitHub Actions. But running it locally or on other CI platforms can be particularly helpful. You will essentially be running Super Linter as you were locally in any other CI platforms.
You first want to pull the latest Docker container down from DockerHub with this command:
docker pull github/superlinter:latest
To run this container you then run the following:
docker run e RUN_LOCAL=true e USE_FIND_ALGORITHM=true VALIDATE_PYTHON_BLACK=true v /project/directory:/
Notice a couple of things here:
RUN_LOCAL
flag to bypass some of the GitHub Actions checks. This automatically sets VALIDATE_ALL_CODEBASE
to true./tmp/lint
so that the linter can pick up the code.Running GitHub Super Linter on other CI platform is pretty similar to locally running it. Here is an example of running it in Azure Pipelines by Tao Yang.
 job: lint_tests
displayName: Lint Tests
pool:
vmImage: ubuntulatest
steps:
 script: 
docker pull github/superlinter:latest
docker run e RUN_LOCAL=true v $(System.DefaultWorkingDirectory):/tmp/lint github/superlinter
displayName: 'Code Scan using GitHub SuperLinter'
This just runs the commands we would have for locally running the Super Linter as a script. You could run it in the exact same way on other CI platforms.
Thank you for sticking with me until the end. I hope that you've taken away a thing or two about linting and using GitHub Super Linter. It has most certainly been one of my favorite open source projects.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @rishit_dagli, where I tweet about open source and machine learning.
]]>This year I got a chance to attend my first inperson KubeCon + CloudNativeCon in Valencia, Spain under the generous Dan Kohn scholarship by CNCF and Linux Foundation. Throwback to when I was applying for the scholarship, I had a ton of things to worry about: I had never seen a highschooler like me receive the scholarship, and being younger than others would I be welcomed in interactive events (PS: which was not the case), this was also my first solo international trip and much more. This might sound scary at first, but the effort of going through these challenges was absolutely worth it, and in this article, I give you some glimpses of my experiences KubeCon + CloudNativeCon Europe. These would be paired with some of my learnings from the conference which might help you too!
This years Dan Kohn Scholars (can you spot me?)
In the past, I have attended a KubeCon + CloudNativeCon event virtually but the inperson event experience was completely different and a ton of fun. CNCF is one of the best communities I have seen and the best part of such conferences is getting to meet this amazing CNCF community inperson: I got to meet people who I had drawn inspiration from, people who I had worked with for quite some time and made a ton of new friends as well.
KubeCon + CloudNativeCon is also one of the largest conferences in the world where developers and companies excited about or adopting Cloud Native, gather and discuss new ideas. KubeCon is a fiveday conference comprising colocated events, technical and nontechnical talks, networking sessions, interactive activities, parties, and more. Be it virtual or inperson you should expect five days packed with learning, networking, and a lot of fun. I will partition my experiences among the five days to share in this blog post.
The first two days of KubeCon usually consist of different colocated events. Colocated events are addon events to the main KubeCon activities and focus on a specific topic. I was particularly interested in attending Kubernetes AI Day considering my interests. The colocated event started off with some talks: often about specific technical concepts in Machine Learning on Kubernetes and a talk about solving socially impactful medical problems with machine Learning too which were quite interesting to watch. Almost all the talks at Kubernetes AI Day were paired with a demo or a handson component and something I do to make the most presented in these events is to make a list of talks or demos I got particularly excited by and then watch the recordings of the talks in this list. This time around I have the liberty to better understand and try out the demos for myself. Since I had already watched and processed the talk earlier, I end up often learning on top of the presented content in this exercise. Well, I am still in process of doing so for all the talks in my list but for the ones I have done so, this is something that has seemed to work quite well for me.
Talk at the Kubernetes AI Day
Colocated events are not just about attending talks, you have ample amount of time to network with speakers or other people attending the event. I would recommend anyone to make the best use of opportunities like these. Going to my first networking session, I was rid of all my fear, and everyone around was helpful and considerate not only did I not feel left out but it was a pleasure when people came over and told me they had used one of my libraries or seen me in the opensource community. All I would tell anyone attending the networking sessions is to remember that the community around them is helpful, all you need to do is ask! Following a lengthy discussion with a speaker from the Kubernetes AI Day, I even developed a project and talk idea which is now accepted at Kubernetes Community Days Berlin.
I also attended a couple of Kubernetes Contributor Summit talks which had a completely different vibe, an informal conference with a ton of discussion.
Networking at the Kubernetes AI Day
With the ending of Kubernetes AI Day, it was almost evening but you could not tell so due to the pleasant atmosphere in Valencia, there was proper daylight until 9:00 PM each day. Usually, in the latter half of the day, there are multiple parties one can attend which adds an element of fun as well as a place to do longer discussions and casually meet others from the community. I reached just in time to see Bart Farell who leads the Data on Kubernetes community sing. My plans for this day though were to attend the famous Kubernetes Contributor Summit Party which is a great way for contributors to meet and greet while having some fun. My major highlight from the Kubernetes Contributor Summit was getting to chat with people whom I drew inspiration from, Saiyam Pathak, Divya Mohan, David Flanagan, Tim Bannister, and a lot more!
Caption: The Kubernetes Contributor Summit
Owing to all the activities we had in Kubernetes Contributor Summit (our team won the Kubernetes Trivia too :) ), it was quite late but the weather in Valencia was so nice that we did a threeperson party on the beach near the venue with some new friends I made at the Contributor Summit. I not only heard some fantastic stories and workplace mess ups :laughing: but got some really insightful advice for choosing colleges, which I would in a couple of months. It was just the first day of KubeCon and I had already made many new friends, and got a ton of perspective and insightful advice!
Making some new friends.
My Day 2 started off with a grand unofficial breakfast with most of the scholarship recipients and speakers from my country, India since most of us were staying in the same hotel anyways (Thanks CNCF for the room block). Being interested in Machine Learning research myself, day 2 of the conference started on a high note for me, as I got to meet and sit down for lunch with Han Xiao, the author of Fashion MNIST which is a wellcited and known paper in the Machine Learning space. As well as I got to meet multiple times and have discussions with Kunal Kushwaha and Shivay Lamba too whose work I had been following for quite a long. There could not have been a better start.
Day 2 is also the colocated event day and my plan for that day was to attend the OpenShift Commons Gathering colocated event. I was also excited to attend a couple of project meetings. Project meetings at KubeCon are longer workshop format talks typically in a smaller room, where you not only get to ask a ton of questions, more like a discussion of some sort but also interact with the maintainers of projects. These events are not livestreamed or recorded.
Something you should definitely try at conferences is to go outside your comfort zone, there is probably a technology you might have not worked with or an area of development you had never explored, KubeCon is the place for you to attend a session and see if it interests you or just come back with a decent knowledge of the technology. In fact I had only explored a bit of OpenShift earlier and just came to the OpenShift Commons Gathering with a little research yes, I did not understand all the content in this colocated event but this gave me a way to start exploring indepth about OpenShift now.
OpenShift Commons Gathering
I also had a great time attending the Litmus Chaos and containerd project meetings which were really great. Going in with very little knowledge of Litmus Chaos, the atmosphere of these meetings and the people around made me feel welcome. To make the most out of a tutorial or a workshop style talk I would suggest you try and do two things beforehand which worked quite well for me:
Finally, I also attended a discussion event which is super fun and KubeCon had a lot of these, this one was by the TODO group on opensource. These kinds of events not only allow you to get insights from other people but also share your experiences and get some feedback and visibility for your work. I am also going to another Linux Foundation event "OpenJS World" from 7 June 2022 and I also got the chance to meet the executive director of OpenJS in this discussion! I would suggest anyone attending inperson conferences to not be afraid to just go up and say Hi to others.
The JFrog team
This day when I was off for sightseeing at a museum, I ended up meeting a fellow KubeCon attendee while asking him to click a photo of me, this was the time when KubeCon brought >7,000 people to Valencia, you could literally meet developers everywhere outside of KubeCon as well. We ended the day attending the JFrog party where I got to meet a ton of awesome folks from JFrog as well as others and Im also working on a new opensource project with friends I made from this party. Parties at KubeCon are just great opportunities, make sure to use those well. The one bad decision I made on day 2 was thinking I could walk 9 kilometers back to my hotel.
Day 3 was when the main conference started, most of the attendees come to the conference today, and thus what was a less crowded place for the first two days became a super crowded place today as everyone tried to get in for the keynote. The line to get inside the venue spanned a block outside the venue but just think about all the developers KubeCon brought together.
Trying to enter the conference on day 3
once I got to the venue after the long lines I attended the keynote talks which probably everyone at KubeCon would definitely attend. I was looking forward to the keynote talks as well which were started by Priyanka Sharma who also invited some other folks from the community. Here is a very brief about the keynotes:
Day 3 is also when one of the most famous aspects of KubeCon + CloudNativeCon starts: the sponsor showcase which is a great way for one to discuss with other developers, discover new technologies, brainstorm project ideas, and get some sweet swag (thanks IBM for a drone) and often the place where you'll end up with a job (in my case internship) offers too. You could also have some fun in this area with some stalls like Microsoft allowing you to play Forza Horizon on a real steering wheel or GitLab with their VR racing stall and multiple other games. Unlike what you would make from the name, most of the stalls in the sponsor showcase area are actually manned by developers who are super excited to discuss projects you are making, show you some live demos, and also give you advice  making the sponsor showcase even better.
Here's a video I took just to help you get an idea of the popularity of the sponsor showcase among KubeCon attendees, taken just before the sponsor showcase opened.
I finally ended another hectic day by attending the Microsoft and AWS party where I also got to meet and chat with Bridget Kromhout and Stephen Augustus among others from who I greatly drew inspiration, undoubtedly I had a ton of fanboy moments at KubeCon. PS: If you want to see some super funny moments from the AWS party, check out the tweets from this account
I was quite excited for the talks on Day 4 since they featured the two talks I was personally most excited to attend, by the Machine Learning team at CERN. Apart from ML research, I am also interested in particle physics making CERN one of the experiences I have dreamt of taking on. I was not only able to learn from these presentations but also utilize the aftertalk time to discuss with the speakers.
With the ML team at CERN
A couple of things that have come in super handy to me while networking at KubeCon are:
A highlight of the keynote talks I attended:
Day 4 is also when the grand all attendee party happens which was at a palace around an hour by a bus. This bus ride to most parties is often a ton of fun too since everyone on the bus is a KubeCon attendee you could literally start a tech conversation with anyone on the bus or talk about sessions at KubeCon. The allattendee party was also when I got to appear in the KubeCon highlight video being shot by The Linux Foundation and also got to chat with Priyanka whose keynote I had watched the other day and loved it! Finally, to end the day, I attended the Giant Swarm party which was at probably one of the most beautiful venues.
This was the last day of KubeCon and my last day in Valencia, Spain as well and it just felt that the conference got over way too soon, I did not realize how fast the last 4 days went. I started my day 5 attending some keynote sessions I found exciting which I now give you a quick overview of:
Following this, I attended the "Peer Group Mentoring + Career Networking" which included multiple roundtables with one of three topics related to growing within your career. You would switch roundtables thrice to cover each topic and each roundtable would have a mentor to help you. This was indeed super helpful, especially with awesome mentors like Davanum Srinivas and more. One of the quite fun talks I attended was called "What Anime Taught Me About K8s Development & Tech Careers", having loved watching anime myself I loved attending this talk which was essentially a talk on how one could get started on their Cloud Native journey, though I had watched a similar talk back when I had started my journey with Cloud Native but well I was in there for anime.
Another roundtable session I attended this day was the "SIG Meet and Greet" where, as the name suggests you could meet SIG Leaders and talk about the development or roadmap or how to get involved in the SIG, and a lot more. Such roundtable sessions are quite helpful and I am very glad that I got to attend some of them. I learned about a couple of Working Groups and Special Interest Groups like the WG Batch and SIG ContribEx which I will start contributing to as well.
I ended the day attending a couple more sessions but left a bit early for the venue as some of us, Dan Kohn scholars and some speakers, from India, had a casual party planned where we got our hands on Indian food for the first time in (for me) 9 days and then just roamed around the streets of Valencia before boarding a flight back to my home city Mumbai the next morning.
Finally, the conference ended and I am already looking forward to the next KubeCon + CloudNativeCon and learning more from the fantastic CNCF community, I came back with a ton of great experiences and look forward to more! Post conference seeing a couple of tweets like this one made me feel pretty nice about the people I got to connect with and continue the discussions postconference too.
PS: Not tomorrow, now is the right time for you to be a part of the opensource community and start contributing if you aren't already.
Feel free to reach out to me on Twitter @rishit_dagli to talk about KubeCon, Kubernetes, CNCF or well anything.
]]>Graph Neural Networks are getting more and more popular and are being used extensively in a wide variety of projects.
In this article, I help you get started and understand how graph neural networks work while also trying to address the question "why" at each stage.
Finally, we will also take a look at implementing some of the methods we talk about in this article in code.
And don't worry you won't need to know very much math to understand these concepts and learn how to apply them.
Put quite simply, a graph is a collection of nodes and the edges between the nodes. In the below diagram, the white circles represent the nodes, and they are connected with edges, the red colored lines.
You could continue adding nodes and edges to the graph. You could also add directions to the edges which would make it a directed graph.
Something quite handy is the adjacency matrix which is a way to express the graph. The values of this matrix \(A_{ij}\) are defined as:
\[A_{ij} = \left\{\begin{array}{ c l }1 & \quad \textrm{if there exists an edge } j \rightarrow i \\ 0 & \quad \textrm{if no edge exists} \end{array} \right. \]
Another way to represent the adjacency matrix is simply flipping the direction so in the same equation \(A_{ij}\) will be 1 if there is an edge \(i \rightarrow j\) instead.
The later representation is in fact what I studied in school. But often in Machine Learning papers, you will find the first notation used so for this article we will stick to the first representation.
There are a lot interesting things you might notice from the adjacency matrix. First of all, you might notice that if the graph is undirected, you essentially end up with a symmetric matrix and more interesting properties, especially with the eigen values of this matrix.
One such interpretation which would be helpful in the context is taking powers of the matrix \((A^n)_{ij}\) gives us the number of (directed or undirected) walks of length \(n\) between nodes \(i\) and \(j\).
Well graphs are used in all kinds of common scenarios, and they have many possible applications.
Probably the most common application of representing data with graphs is using molecular graphs to represent chemical structures. These have helped predict bond lengths, charges, and new molecules.
With molecular graphs, you can use Machine Learning to predict if a molecule is a potent drug.
For example, you could train a graph neural network to predict if a molecule will inhibit certain bacteria and train it on a variety of compounds you know the results for.
Then you could essentially apply your model to any molecule and end up discovering that a previously overlooked molecule would in fact work as an excellent antibiotic. This is how Stokes et al. in their paper (2020) predicted a new antibiotic called Halicin.
Another interesting paper by DeepMind (ETA Prediction with Graph Neural Networks in Google Maps, 2021) modeled transportation maps as graphs and ran a graph neural network to improve the accuracy of ETAs by up to 50% in Google Maps.
In this paper they partition travel routes into super segments which model a part of the route. This gave them a graph structure to operate over on which they run a graph neural network.
There have been other interesting papers that represent naturally occurring data as graphs (social networks, electrical circuits, Feynman diagrams and more) that made significant discoveries as well.
And if you think abut it, a standard neural network can be represented as a graph too 🤯.
Let's first start with what we might want to do with our graph neural network before understanding how we would do that.
One kind of output we might want from our graph neural network is on the entire graph level, to have a single output vector. You could relate this kind of output with the ETA prediction or predicting binding energy from a molecular structure from the examples we talked about.
Another kind of output you might want is the node or edge level predictions and end up with a vector for each node or edge. You could relate this with an example where you need to rank every node in the prediction or probably predict the bond angle for all bonds given the molecular structure.
You might also be interested in answering the question "Where should I place a new edge or a node" or predict where an edge or a node might appear. We could not only get that prediction from the graph, but then we could also turn some other data into a graph.
As you might have guessed with the graph neural network, we first want to generate an output graph or latents from which we would then be able to work on this wide variety of standard tasks.
So essentially what we need to do from the latent graph (features for each node represented as \(\vec{h_i}\)) for the graph level predictions is:
\[\vec{Z_G} = f(\sum_i \vec{h_i})\]
And now it is quite simple to show on a high level what we need to do from the latents to get our outputs.
For node level outputs we would just have one node vector passed into our function and get the predictions for that node:
\[\vec{Z_i} = f(\vec{h_i})\]
Now that we know what we can do with the graph neural networks and why you might want to represent your data in graphs, let's see how we would go about training on graph data.
But first off, we have a problem on our hands: graphs are essentially variable size inputs. In a standard neural network, as shown in the figure below, the input layer (shown in the figure as \(x_i\)) has a fixed number of neurons. In this network you cannot suddenly apply the network to a variable sized input.
But if you recall, you can apply convolutional neural networks on variable sized inputs.
Let's put this in terms of an example: you have a convolution with the filter count \(K=5\), spatial extent \(F=2\), stride \(S=4\), and no zero padding \(P=0\). You can pass in \((256 \times 256 \times 3)\) inputs and get \((64 \times 64 \times 5)\) outputs (\(\left \lfloor{\frac{2562+0}{4}+1}\right \rfloor\)) and you can also pass \((96 \times 96 \times 6)\) inputs and get \((24 \times 24 \times 5)\) outputs and so on it is essentially independent of size.
This does make us wonder if we can draw some inspiration from convolutional neural networks.
Another really interesting way of solving the problem of variable input sizes that takes inspiration from Physics comes from the paper Learning to Simulate Complex Physics with Graph Networks by DeeepMind (2020).
Let's start off by taking some particles \(i\) and each of those particles have a certain location \(\vec{r_i}\) and some velocity \(\vec{v_i}\). Let's say that these particles have springs in between them to help us understand any interactions.
Now this system is, of course, a graph: you can take the particles to be nodes and the springs to be edges. If you now recall simple highschool physics, \(force = mass \cdot acceleration\) and, well, what is another way in this system to denote the total force acting on the particle? It is the sum of forces acting on all neighboring particles.
You can now write (\(e_{ij}\) represents the properties of the edge or spring between i and j):
\[m\frac{\mathrm{d} \vec{v_i}}{\mathrm{d}t} = \sum_{j \in \textrm{ neighbours of } i } \vec{F}(\vec{r_i}, \vec{r_j}, e_{ij})\]
Something I would like to draw your attention to here is that this force law is always the same. Maybe there are differences in the properties of the spring or edge, but you can still apply the same law. You can have different numbers of nodes and edges and you can still apply the exact same equation of motion.
If you look closely, the intuitions we discussed to get around the problem of fixed inputs have an aspect of similarity to them: it is fairly clear in writing that the second approach takes into account the neighboring nodes and edges and creates some function (here force) of it. I wanted to point out that the way convolutional neural networks work is not much different.
Now that we've discussed what might give us inspiration to create a graph neural network, let's now try actually building one. Here we'll see how we can learn from the data residing in a graph.
We will start by talking about "Neural Message Passing" which is analogous to filters in a convolutional neural network or force which we talked about in the earlier section.
So let's say we have a graph with 3 nodes (directed or undirected). As you might have guessed, we have a corresponding value for each node \(x_1\), \(x_2\) and \(x_3\).
Just like any neural network, our goal is to find an algorithm to update these node values which is analogous to a layer in the graph neural network. And then you can of course keep on adding such layers.
So how do you do these updates? One idea would be to use the edges in our graph. For the purposes of this article, let's assume that from the 3 nodes we have an edge pointing from \(x_3 \rightarrow x_1\). We can send a message along this edge which will carry a value that will be computed by some neural network.
For this case we can write this down like below (and we will break down what this means too):
\[\vec{m_{31}}=f_e(\vec{h_3}, \vec{h_1}, \vec{e_{31}})\]
We will use our same notations:
And let's say we have an edge from \(x_2 \rightarrow x_1\) as well. We can apply the same expression we created above, just replacing the node numbers.
If you have more nodes, you would want to do this for every edge pointing to node 1. And the easiest way to accumulate all these is to simply sum them up. Look closely and you will see this is really similar to the intuition from particles we had discussed earlier!
Now you have an aggregated value of the messages coming to node 2 but you still need to update its weights. So we will use another neural network \(f_v\) often called the update network. It depends on two things: your original value of node 3 of course and the aggregate of the messages we had.
Simply putting these together not just for node 3 in our example but for any node in any graph, we can write it down as:
\[ \vec{h_i^{\prime}} = f_v(h_i, \sum_{j \in N_i} \vec{m_{ij}}) \]
\(\vec{h_i^{\prime}}\) are our update node values, and \(\vec{m_{ij}}\) is the messages coming to node \(i\) we calculate earlier.
You would then apply these same two neural networks \(f_e\) and \(f_v\) for each of the nodes comprising the graph.
A really important thing to note here is that the two neural networks where we have to update our node values operate on fixed sized inputs like a standard neural network. Generally the two neural networks we spoke of \(f_e\) and \(f_v\) are small MLPs.
Earlier we talked about the different kind of outputs we are interested in obtaining from our graph neural networks. You might have already noticed that when training our model the way we talked about, we will be able to generate the node level predictions: a vector for each node.
To perform graph classification, we want to try and aggregate all the node values we have after training our network. We will use a readout or pooling layer (quite clear how the name comes).
Generally we can create a function \(f_r\) depending on the set of node values. But it should also be permutation independent (should not matter on your choice of labelling the nodes), and it should look something like this:
\[y^{\prime} = f_r({x_i \vert i \in \textrm{ graph} })\]
The simplest way to define a readout function would be by summing over all node values. Then finding the mean, maximum, or minimum, or even a combination of these or other permutation invariant properties best suiting the situation. Your \(f_r\), as you might have guessed, can also be a neural network which is often used in practice.
The ideas and intuitions we just talked about create the Message Passing Neural Networks (MPNNs), one of the most potent graph neural networks first proposed in Neural Message Passing for Quantum Chemistry (Gilmer et al. 2017).
It now seems like we have indeed created a general graph neural network. But you can see that our message network requires \(e_{ij}\), the edge property just as you randomly initialize node values at start.
But while the node values get changed at each step, the edge values are also initialized by you but they're not changed. So, we need to try and generalize this as well, an extension to what we just saw.
Understanding how the node updates work, I think you can very easily apply something similar for an edge update function as well.
\(U_{edge}\) is another standard neural network:
\[e_{ij}^{\prime} = U_{edge}(e_{ij}, x_i, x_j)\]
Something you could also do with this framework is that the outputs by \(U_{edge}\) are already edge level properties so why not just use them as my message? Well, you could do this as well.
Message Passing Neural Networks (MPNN) are the most general graph neural network layers. But this does require storage and manipulation of edge messages as well as the node features.
This can get a bit troublesome in terms of memory and representation. So sometimes these do suffer from scalability issues, and in practice are applicable to small sized graphs.
As Petar Velikovi says "MPNNs are the MLPs of the graph domain". We will be looking at some extensions of MPNNs as well as how to implement an MPNN in code.
You can quite easily apply exactly what we talked about in either PyTorch or TensorFlow but try doing so and you will see that this just blows up the memory.
Usually what we do with standard neural networks is work on batches of data. So you usually pass in an input array of shape [batch size, # of input neurons] to the neural network to make it work efficiently.
Now our number of input neurons here are not the same as highlighted earlier, and yes, convolutional neural networks do deal with arbitrary sized images. But when you think in terms of batches, you need all the images to be the same dimensions.
There are multiple things you could do:
In this section I will give you an overview of some other widely used graph neural network layers.
We won't be looking at the intuition behind any of these layers and how each part pieces together in the update function. Instead I'll just give you a high level overview of these methods. You could most certainly read the original papers to get a better understanding.
One of the most popular GNN architectures is Graph Convolutional Networks (GCN) by Kipf et al. which is essentially a spectral method.
Spectral methods work with the representation of a graph in the spectral domain. Spectral here means that we will utilize the Laplacian eigenvectors.
GCNs are based on top of ChebNets which propose that the feature representation of any vector should be affected only by his khop neighborhood. We would compute our convolution using Chebyshev polynomials.
In a GCN this is simplified to \(K=1\). We will start off by defining a degree matrix (row wise summation of adjacency matrix):
\[\tilde{D}_{ij}=\sum_j\tilde{A}_{ij}\]
The graph convolutional network update rule after using a symmetric normalization can be written where H is the feature matrix and W is the trainable weight matrix:
\[H^{\prime}=\sigma(\tilde{D}^{1/2} \tilde{A}\tilde{D}^{1/2} HW)\]
Nodewise, you can write this as where \(N_i\) and \(N_j\) are the sizes of the node neighborhoods:
\[\vec{h_i^{\prime}} = \sigma(\sum_{i \in N_j} \frac{1}{\sqrt{N_iN_j}} W \vec{h_j^{\prime}} )\]
Of course with GCN you no longer have edge features, and the idea that a node can send a value across the graph which we had with MPNN we discussed earlier.
Recall the nodewise update rule in GCN we just saw? \(\frac{1}{\sqrt{N_iN_j}\) is derived from the degree matrix of the graph.
In Graph Attention Network (GAT) by Velikovi et al., this coefficient \(\alpha_{ij}\) is computed implicitly. So for a particular edge you take the features of the sender node, receiver node, and the edge features as well and pass them through an attention function.
\[a_{ij}=a(\vec{h_i}, \vec{h_j}, \vec{e_{ij}})\]
\(a\) could be any learnable, shared, selfattention mechanism like transformers. These could then be normalized with a softmax function across the neighborhood:
\[\alpha_{ij}=\frac{e^{a_{ij}}}{\sum_{k \in N_i} e^{a_{ik}}}\]
This constitutes the GAT update rule. The authors hypothesize that this could be significantly stabilized with multihead self attention. Here is a visualization by the paper's authors showing a step of the GAT.
This method is also very scalable because it had to compute a scalar for the influence form node i to node j and note a vector as in MPNN. But this is probably not as general as MPNNs, though.
With multiple frameworks like PyTorch Geometric, TFGNN, Spektral (based on TensorFlow) and more, it is indeed quite simple to implement graph neural networks. We will see a couple of examples here starting with MPNNs.
Here is how you create a message passing neural network similar to the one in the original paper "Neural Message Passing for Quantum Chemistry" with PyTorch Geometric:
import torch.nn as nn
import torch.nn.functional as F
import torch_geometric.transforms as T
from torch_geometric.utils import normalized_cut
from torch_geometric.nn import NNConv, global_mean_pool, graclus, max_pool, max_pool_x
def normalized_cut_2d(edge_index, pos):
row, col = edge_index
edge_attr = torch.norm(pos[row]  pos[col], p=2, dim=1)
return normalized_cut(edge_index, edge_attr, num_nodes=pos.size(0))
class Net(nn.Module):
def __init__(self):
super().__init__()
nn1 = nn.Sequential(
nn.Linear(2, 25), nn.ReLU(), nn.Linear(25, d.num_features * 32)
)
self.conv1 = NNConv(d.num_features, 32, nn1, aggr="mean")
nn2 = nn.Sequential(nn.Linear(2, 25), nn.ReLU(), nn.Linear(25, 32 * 64))
self.conv2 = NNConv(32, 64, nn2, aggr="mean")
self.fc1 = torch.nn.Linear(64, 128)
self.fc2 = torch.nn.Linear(128, d.num_classes)
def forward(self, data):
data.x = F.elu(self.conv1(data.x, data.edge_index, data.edge_attr))
weight = normalized_cut_2d(data.edge_index, data.pos)
cluster = graclus(data.edge_index, weight, data.x.size(0))
data.edge_attr = None
data = max_pool(cluster, data, transform=transform)
data.x = F.elu(self.conv2(data.x, data.edge_index, data.edge_attr))
weight = normalized_cut_2d(data.edge_index, data.pos)
cluster = graclus(data.edge_index, weight, data.x.size(0))
x, batch = max_pool_x(cluster, data.x, data.batch)
x = global_mean_pool(x, batch)
x = F.elu(self.fc1(x))
x = F.dropout(x, training=self.training)
return F.log_softmax(self.fc2(x), dim=1)
You can find a complete Colab Notebook demonstrating the implementation here, and it is indeed quite heavy. It is quite simple to implement this in TensorFlow as well, and you can find a full length tutorial on Keras Examples here.
Implementing a GCN is also quite simple with PyTorch Geometric. You can easily implement it with TensorFlow as well, and you can find a complete Colab Notebook here.
class Net(torch.nn.Module):
def __init__(self):
super().__init__()
self.conv1 = GCNConv(dataset.num_features, 16, cached=True,
normalize=not args.use_gdc)
self.conv2 = GCNConv(16, dataset.num_classes, cached=True,
normalize=not args.use_gdc)
def forward(self):
x, edge_index, edge_weight = data.x, data.edge_index, data.edge_attr
x = F.relu(self.conv1(x, edge_index, edge_weight))
x = F.dropout(x, training=self.training)
x = self.conv2(x, edge_index, edge_weight)
return F.log_softmax(x, dim=1)
And now let's try implementing a GAT. You can find the complete Colab Notebook here.
class Net(torch.nn.Module):
def __init__(self, in_channels, out_channels):
super().__init__()
self.conv1 = GATConv(in_channels, 8, heads=8, dropout=0.6)
# On the Pubmed dataset, use heads=8 in conv2.
self.conv2 = GATConv(8 * 8, out_channels, heads=1, concat=False,
dropout=0.6)
def forward(self, x, edge_index):
x = F.dropout(x, p=0.6, training=self.training)
x = F.elu(self.conv1(x, edge_index))
x = F.dropout(x, p=0.6, training=self.training)
x = self.conv2(x, edge_index)
return F.log_softmax(x, dim=1)
Thank you for sticking with me until the end. I hope that you've taken away a thing or two about graph neural networks and enjoyed reading through how these intuitions for graph neural networks form in the first place.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
Lastly, for the motivated reader, among others I would also encourage you to read the original paper "The Graph Neural Network Model" where GNN was first proposed, as it is really interesting. An openaccess archive of the paper can be found here. This article also takes inspiration from Theoretical Foundations of Graph Neural Networks and CS224W which I suggest you to check out.
You can also find me on Twitter @rishit_dagli, where I tweet about machine learning, and a bit of Android.
]]>Developers around the world use GitHub to share their projects with the global developer community.
In this article, Ill give you some opinionated tips to help you build a great opensource project you can start using. You can also use these tips for building hackathon projects.
Quite recently I ended up on the coveted GitHub Trending page. I was the #2 trending developer on all of GitHub and for Python as well, which was a pleasant surprise for me on the morning of 7 September. And it was based on code I wrote at 4 am.
I was also featured on the GitHub daily newsletter, all after opensourcing one of my projects.
I will be sharing some tips in this article which you should be able to apply to all kinds of projects and not just Python packages like my own. You can check out my repo here.
It is almost impossible to game the GitHub Trending section:
GitHubs definition (of trending) takes into account a longer term definition of trending and uses more complex measurement than sheer number of stars which helps to keep people from farming the system.
Founders often create startups based on problems they have personally encountered. With opensourced code, you will likely be trying to solve problems that developer commonly have.
And since gaming the GitHub Trending section is almost impossible, you need a strong motivation a big, common developer problem to work on. So how do you stumble onto a developer problem?
Well, for starters you can participate in hackathons, build projects, and experiment with other projects. And you will soon find something which could be made into a library, something you could make a utility out of, and so on.
Your motivation for building your project could come from anywhere. In my case, I explore new Machine Learning papers daily on arXiv (an openaccess archive for papers) and read the ones I find interesting. One such paper I read motivated me to build my Python package.
Another time, I was in a hackathon training a Machine Learning model and wanted to participate in other festivities. Our team then decided to build another opensource project called TFWatcher.
So you see, you'll likely find all sorts of issues you can work on when you're building a project.
And just to note when I say you should have a strong motivation, I do not mean the project should be really huge or really complex. It could certainly be a simple project that could make developers' lives easier.
Think about it this way: if there was a project like the one you want to develop, would you use it? If the answer is yes, you have enough motivation to build the project, regardless of the size or complexity.
A man in Oakland, California, disrupted web development around the world last week by deleting 11 lines of code. Keith Collins in How one programmer broke the internet by deleting a tiny piece of code
You might know about leftpad
, a very small opensource npm package with just 11 lines of straightforward code. But it was used by tons of developers around the world, which just reinforces what I was talking about above.
Once you find a developer problem you want to solve and have enough motivation to start working on it, you'll ideally want to spend quite a bit of time doing your research.
I believe it is a good practice to try and answer these questions through your research:
If it has not been done yet, and there's a need for it, go ahead and start building it.
If something similar exists, is well developed, and is heavily used too, you might want to move on.
There are a huge number of opensource projects out there already, and it is quite common to find a repository doing similar stuff (more common than you would think). But you can still work on your project and make it better.
If something similar exists, your goals could be to make it more modular or more efficient. You could try implementing it in some other language or improve it in any other number of ways.
A great way to do so is to take a look at the issues for the existing repository. Try doing your research with existing solutions (if there are any) and find out what aspect of the project could possibly be improved. Your work could even be a derivative of the other project.
In my case, as I mentioned, I took inspiration from an interesting research paper I read (Fastformer: Additive Attention Can Be All You Need). I also discovered an official code implementation and a community implementation of the paper, both in PyTorch.
In the end, my repository, though a derivative of the research papers, was quite different from existing code implementations.
ELI5, or explain it like I'm five, is a great exercise that I like doing as soon as I have an idea for a repository.
I try to explain what the project aims to achieve or how it works or why it is better than similar repositories to a friend who doesn't know a lot about the subject. Often with some helpful analogies.
By doing this, it helps me develop or get a clear understanding of what I want to do in my projects. Trying to explain the project to a friend often also helps me spot any flaws in my plan or assumptions I may have made while ideating on the project.
This process really helps me when I start the development phase of the project. This is also when I start creating a project board. You can make a project board on GitHub itself, or you can use Trello, JetBrains Spaces, and so on.
I love using GitHub Projects and Issue checklists at this stage to help me manage, prioritize, and have a clear highlevel idea of what I need to do.
You can often take inspiration and learn from repositories belonging to similar categories. Check out how their code is structured. You should try scouting other great repositories and read through their wellwritten code.
In my case, I really loved the way reformerpytorch was written. It's easy to use in your projects as a Python library, it expects you to ideally care about only a single class abstracting a lot of the model building process, and returns an instance of torch.nn.Module
(in Pytorch, a base class for all neural network modules) which you can pretty much do anything with. I ended up building FastTransformer in a quite similar way.
You might have heard the popular quote:
Any fool can write code that a computer can understand. Good programmers write code that humans can understand. Martin Fowler
If people can not understand your code, they will not use it.
I've noticed in my own repositories that, when I spend more time trying to make things simple and easy to use, those projects end up getting more traction. So try to spend the extra time making your project more usable and intuitive.
As a general rule of thumb while developing your repo:
A good README is undoubtedly one of the most important components of the repository. It's displayed on the repository's homepage.
Potential contributors usually first check out the README, and only then if they find it interesting will they look at the code or even consider using the project.
Also, this is not a definitive guide to writing a README. Just play around with it and experiment with what works for your project.
Generally, you'll want to include these components in your README:
Try to describe the project in just 34 lines. Don't worry about including too many minute details or features you can add them in later sections. This is also the first thing visitors to your repository would be reading, so be sure to make it interesting.
If you have a logo or a cover image for your project, include it here. It helps contributors to have some sort of visual.
You will often see small badges at the top of the README that convey metadata, such as whether or not all the tests are passing for the project.
You can use shields.io to add some to your README. These will often give your project a lot of credibility without visitors having to go through all of your code.
You should always try to include visuals in your README. These could be a gif showing your project in action or a screenshot of your project.
Good graphics in the README can really help convince other developers to use your project.
You should also include specific installation guidelines. Include all the required dependencies and anything else other devs need to install in order to use your project.
If you faced any problems while setting up your project or installing a dependency, it is quite likely users will face that too, make sure you talk about it.
These can be super straightforward:
I think this is super important to have in your README. You should not expect other developers to do a lot of homework or read your code make it as easy as you can for them.
Always make sure and doublecheck that your code examples or a "How To" section are easily reproducible. Also, make sure it's understandable for a wide range of users, and don't leave out any required instructions for reproducing it.
Since my project is a Python package, I created an accompanying Colab Notebook to demonstrate the use of the package. This lets people easily try it out on their browsers without having to install anything on their own machines.
There are quite a few products that let you do this like repl.it, Glitch, Codepen, and so on.
It is often helpful to list out features your project has and problems it can help solve. You don't have to cover all the features you have worked on, but share the main ones.
This will help developers understand what your project can do and why they should use it.
Finally, you should clearly state if you are open to contributions and what your requirements are for accepting them. You should also document commands to lint the code or run tests.
These steps help to ensure high code quality and reduce the likelihood that the changes inadvertently break something.
Suppose you think your README gets too long. In that case, you can create an additional documentation website and link to it in the README rather than omitting any important information.
Since I work a lot with Python, I generally use Sphinx to generate my documentation from Python docstrings. I find Sphinx quite flexible and easy to set up.
There are quite a lot of options to generate your documentation out there: mkdocs , Docsaurus, docsify, and more. For my project which started trending, however, I didn't need an external documentation website.
Here is an example of what I feel might be a good start to a README from one of my own projects. This is not the complete README, but just how much I could get in a single image:
Finally, for more inspiration, I would suggest that you try using the Make a README guide.
Once you have created a beautiful README and made the project public, you need to think about bringing people to the GitHub page.
First, make sure you add relevant tags to your GitHub repo. These tags will make it much easier for the project to get discovered by people exploring GitHub.
Great places to post about your project are Hacker News and Reddit. Just keep in mind that getting your post to be the top story or the top post on either of these platforms is a difficult task.
When one of my repositories became the top story, it got more than a hundred stars in a couple of hours.
But when I originally posted my repo on Hacker News, I didn't have a single upvote. It wasn't until someone else from the community noticed my project and posted it on Hacker News that it went on to be the top story. So it often takes a good amount of planning and a little help from your friends to get your project to the top.
In my case, Twitter was a really great place to get the very first visitors for my project and reach out to an external audience.
This often serves as a great way to allow people to quickly see if they might be interested in checking out your project. And you just have a limited number of characters to sell your repository to people.
Also, make that you don't overdo posting about your project on any platform because it might get flagged as spam.
I often get emails or messages about shouting out a project or a book. But I strongly believe that this does not make a lot of sense.
If you want someone to shout out your project, then you would want them to have used your project and found it useful before shouting it out. So an easier way to do so is by just adding a "tweet this project" button to the README. Then people who like the project can naturally shout it out.
Also, keep in mind that shoutouts don't directly bring you stars. People coming from shoutouts will only star your project if they like it.
This most certainly does not mean that you should not ask people for help or feedback or code reviews. Indeed, you should always try to address all kinds of feedbacks: improvements, bugs, inconsistencies, and so on.
Always be ready to embrace negative feedback and think about how you can improve on it. You might end up learning a couple new things :)
In my case, I noticed an unusually high number of visitors coming to the project from Twitter. My project was implemented with TensorFlow and Keras, and a couple of days later (after I got shouted out) I discovered that the creator of Keras himself shouted out my project!
This was most probably because I added a "Tweet this project" at the top of my README and let the discoveries come in on their own.
As an interesting exercise, I tried asking the community on Twitter for any tips they might have for me to include in this blog. Here are a few of them:
(You should add) 1. Documentation 2. Decision logs Shreyas Patil
Good issue and PR templates. You could also use the GitHub issue forms. Burhanuddin Rangwala
Documentation (should be complete and) include app architecture, packaging, style guide, coding conversation, links of technology framework library used, pre requisite tokens, etc. Chetan Gupta
Thank you for sticking with me until the end. I hope that you take back a thing or two for your own opensource projects and build better ones. By incorporating these tips and experimenting and by putting in a lot of hard work you can create a great project.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
Many thanks to Santosh Yadav and Shreyas Patil for helping me make this article better.
You can also find me on Twitter @rishit_dagli, where I tweet about opensource, machine learning, and a bit of android.
]]>What if you could monitor your Colab, Kaggle, or AzureML Machine Learning projects on your mobile phone? You'd be able to check in on your models on the fly even while taking a walk🚶.
If you are an ML developer, you know how training models can easily take a long time. How cool would it be to monitor this from your mobile phones?
Well, you can do it and in <5 lines of code.
Before getting on with the tutorial and showing you how it works, let me briefly describe what you can do with TF Watcher, an opensource project we will use to monitor our ML jobs:
Let's now go through the tutorial of how to monitor your models on a mobile device with Google Colab. I'll show you how to use tool this in Google Colab, so anyone can try it out, but you can pretty much replicate this anywhere (even on your local machine).
Feel free to follow along with this colab notebook.
To monitor your Machine Learning jobs on mobile devices, you need to install the tfwatcher
Python package. This is an opensource Python package that I built, and you can find the source code in this GitHub repo.
To install the Python package from PyPI, run the following command in your notebook cell:
!pip install tfwatcher
For the purposes of this example, we will see how you can monitor a training job but you can use this package to monitor your evaluation or prediction jobs too. You will soon also see how you can easily specify the metrics you want to monitor.
In this example, well use the Fashion MNIST, a simple dataset of 60,000 grayscale images of 10 fashion categories. We start by loading the dataset and then we'll do some simple preprocessing to further speed up our example.
However, you can use everything we talk about in this article in your more complex experiments.
Let's fetch the dataset:
import tensorflow as tf
# Load example MNIST data and preprocess it
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(1, 784).astype("float32") / 255.0
x_test = x_test.reshape(1, 784).astype("float32") / 255.0
# Limit the data to 1000 samples to make it faster
x_train = x_train[:1000]
y_train = y_train[:1000]
x_test = x_test[:1000]
y_test = y_test[:1000]
Now we'll create a simple neural network that just has a single Dense
layer. I'll be showing you how to use this with the TensorFlow's Sequential API, but this works in the exact same way while using the Functional API or subclassed models, too.
# Define the Keras model
def get_model():
model = keras.Sequential()
model.add(keras.layers.Dense(1, input_dim=784))
model.compile(
optimizer=keras.optimizers.RMSprop(learning_rate=0.1),
loss="mean_squared_error",
metrics=["accuracy"],
)
return model
You might have noticed that while compiling our model, we also specified metrics
which lets us specify which metrics we need to monitor.
Here I mention "accuracy" so I should be able to monitor accuracy on my mobile device. By default, we logged "loss" so in this case, we would be monitoring 2 metrics: loss and accuracy.
You can add as many metrics as you need. You can also use TensorFlow's builtin metrics or add your own custom metric, too.
You will now import TF Watcher and create an instance of one of its classes:
import tfwatcher
MonitorCallback = tfwatcher.callbacks.EpochEnd(schedule = 1)
In this example:
EpochEnd
class from TF Watcher to specify that we are interested in operating in the epoch level. There are quite a few of these classes which you can use for your own needs find out all about the other classes in the documentation.schedule
as 1 to monitor after every 1 epoch. You could pass in 3 instead (to monitor after every 3 epochs) or you could also pass in a list of the specific epoch numbers you want to monitor.When you run this piece of code, you should see something like this printed out:
This includes a unique 7 character ID for your session. Be sure to make a note of this ID as you will use it to monitor your model.
Now we will train the model we built and monitor the realtime metrics for training on a mobile device.
model = get_model()
history = model.fit(
x_train,
y_train,
batch_size=128,
epochs=100,
validation_split=0.5,
callbacks = [MonitorCallback]
)
In this piece of code, we start training our model for 100 epochs (should be pretty quick in this case). We also add the object we made in the earlier step as a callback
.
If in your case you are monitoring the prediction instead of training, you would add callbacks = [MonitorCallback]
in the predict method.
Once you run the above piece of code, you can start monitoring it from the web app from your mobile device.
Go to https://www.tfwatcher.tech/ and enter the unique ID you created above. This is a PWA which means you can also install this on your mobile devices and use it as a native android app, too.
Once you add your session ID you should be able to see your logs progressing in realtime through the charts. Apart from the metrics, you should also be able to see the time it took for each epoch. In other cases this might be time taken for a batch, too.
Since ML is highly collaborative, you might want to share your live dashboards with colleagues. To do so, just click the share link button and the app creates a shareable link for anyone to view your live progress or stored dashboards.
Here is the shareable link for the dashboard I created in this tutorial.
Though the example I just showed looked quite cool, there is a lot more we can do with this tool. Now I will briefly talk about two of those scenarios: Distributed Training and noneager execution.
You might often distribute your Machine Learning training across multiple GPUs, multiple machines, or TPUs. You're probably doing this with the tf.distribute.Strategy
TensorFlow API.
You can use it in the exact same way with most distribution strategies with limited use while using ParameterServer
in a custom training loop.
You can find some great examples of how to use these strategies with TensorFlow Keras here.
In TensorFlow 2, eager execution is turned on by default. But you will often want to use tf.function
to make graphs out of your programs. It's a transformation tool that creates Pythonindependent dataflow graphs out of your Python code.
One of the earliest versions of this project used some Numpy calls but guess what, you can now use the code in the same way in noneager mode too.
Thank you for sticking with me until the end. You can now monitor your Machine Learning projects from anywhere on your mobile device and take them to the next level. I hope you are as excited to start using this as I was.
If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!
You can also find me on Twitter @rishit_dagli.
]]>Let's first talk a bit about density curves, as skewness and kurtosis are based on them. They're simply a way for us to represent a distribution. Let's see what I mean through an example.
Say that you need to record the heights of a lot of people. So your distribution has let's say 20 categories representing the range of the output (5859 in, 5960 in ... 7879). You can plot a histogram representing these categories and the number of people whose height falls in each category.
Histogram of height vs population
Well, you might do this for thousands of people, so you are not interested in the exact number rather the percentage or probability of these categories.
I also explicitly mentioned that you have a rather large distribution since percentages are often useless for smaller distributions.
If you use percentages with smaller numbers I often refer to it as lying with statistics it's a statement that is technically correct but creates the wrong impression in our minds.
Let me give you an example: a student is extremely excited and tells everyone in his class that he made a 100% improvement in his marks! But what he doesn't say is that his marks went from a 2/30 to 4/30 😂.
I hope you now clearly see the problem of using percentages with smaller numbers.
Coming back to density curves, when you are working with a large distribution you want to have more granular categories. So you make each category which was 1 inch wide now 2 categories each \(\frac{1}{2}\) inch wide. Maybe you want to get even more granular and start using \(\frac{1}{4}\) inch wide categories. Can you guess where I am going with this?
At a point, we get an infinite number of such categories with an infinitely small length. This allows us to create a curve from this histogram which we had earlier divided into discrete categories. See our density curve below drawn from the histogram.
Probability density curve for our distribution
Great question! As you may have guessed, I like to explain myself with examples, so let's look at another density curve to make it a bit easier for us to understand. Feel free to skip the curve equation at this stage if you have not worked with distributions before.
You can also follow along and create the graphs and visualizations in this article yourself through this Geogebra project (it runs in the browser).
\[ f(x) = \frac{1}{0.4 \sqrt{2 \pi} } \cdot e^{\frac{1}{2} (\frac{x  1.6}{0.4})^2} \]
So now what if I ask you "What percent of my distribution is in the category 1  1.6?" Well, you just calculate the area under the curve between 1 and 1.6, like this:
\[ \int_{1}^{1.6} f(x) \,dx \]
It would also be relatively easy for you to answer similar questions from the density curve like: "What percent of the distribution is under 1.2?" or "What percent of the distribution is above 1.2?"
You can now probably see why the effort of making this making a density curve is worth it and how it allows you to make inferences easily 🚀.
Let's now talk a bit about skewed distributions that is, those that are not as pleasant and symmetric as the curves we saw earlier. We'll talk about this more intuitively using the ideas of mean and median.
From this density curve graph's image, try figuring out where the median of this distribution would be. Perhaps it was easy for you to figure out the curve is symmetrical and you might have concluded that the median is 1.6 since it was symmetric about \(x=1.6\).
Another way to go about this would be to say that the median is the value where the area under the curve to the left of it it and the area under the curve to the right of it are equal.
We're talking about this idea since it allows us to also calculate the median for nonsymmetric density curves.
As an example here, I show two very common skewed distributions and how the idea of equal areas we just discussed helps us find their medians. If we tried eyeballing our median, this is what we'd get since we want the areas on either side to be equal.
Eyeballing the median for skewed curves
You can also calculate the mean through these density curves. Maybe you've tried calculating the mean yourself already, but notice that if you use the general formula to calculate the mean:
\[ mean = \frac{\sum a_n}{n} \]
you might notice a flaw in it: we take into account the \( x \) values but we also have probabilities associated with these values too. And it just makes sense to factor that in too.
So we modify the way we calculate the mean by using weighted averages. We will now also have a term \(w_n\) representing the associated weights:
\[ mean = \frac{\sum{a_n \cdot w_n}}{n} \]
So, we will be using the idea we just discussed to calculate the mean from our density curve.
You can also more intuitively understand this as the point on the xaxis where you could place a fulcrum and balance the curve if it was a solid object. This idea should help you better understand finding the mean from our density curve.
But another really interesting way to look at this would be as the xcoordinate of the point on this curve where the rotational inertia would be zero.
You might have already figured out how we can locate the mean for symmetric curves: our median and mean lie at the same point, the point of symmetry.
We will be using the idea we just discussed, placing a fulcrum on the xaxis and balancing the curve, to eyeball out the mean for skewed graphs like the ones we saw earlier while calculating the median.
We will soon discuss the idea of skewness in greater detail. But at this stage, generally speaking, you can identify the direction where your curve is skewed. If the median is to the right of the mean, then it is negatively skewed. And if the mean is to the right of median, then it is positively skewed.
Later in this article, for simplicity's sake we'll also refer to the narrow part of these curves as a "tail".
Before we talk more about skewness and kurtosis let's explore the idea of moments a bit. Later we'll use this concept to develop an idea for measuring skewness and kurtosis in our distribution.
We'll use a small dataset, [1, 2, 3, 3, 3, 6]. These numbers mean that you have points that are 1 unit away from the origin, 2 units away from the origin, and so on.
So, we care a lot about the distances from the origin in our dataset. We can represent the average distance from the origin in our data by writing:
\[ \frac{\sum a_n 0}{n} = \frac{\sum a_n}{n} \]
This is what we call our first moment. Calculating this for our sample dataset we get 3 but if we change our dataset and make all elements equal to 3,
\[ [1, 2, 3, 3, 3, 6] \rightarrow [3, 3, 3, 3, 3, 3] \]
you'll see that our first moment remains the same. Can we devise something to differentiate our two datasets that have equal first moments? (PS: It's the second moment.)
We will calculate the average sum of squared distances rather than the average sum of distances:
\[ \frac{\sum (a_n)^2}{n} \]
Our second moment for our original dataset is 11.33 and for our new dataset is 9. Notice that the magnitude of the second moment is larger for our original dataset than the new one. Also, we have a higher value for the second moment in the original dataset because it is spread out and has a greater average squared distance.
Essentially we are saying that we have a couple of values in our original dataset larger than the mean value, which, when squared, increases our second moment by a lot.
Here's an interesting way of thinking about moments assume our distribution is mass, and then the first moment would be the center of the mass, and the second moment would be the rotational inertia.
You can also see that our second moment is highly dependent on our first moment. But we are interested in knowing the information the second moment can give us independently.
To do so we calculate the squared distances from the mean or the first moment rather than from the origin.
\[ \frac{\sum (a_n \mu_{1}^{'})^2 }{n} \]
Did you notice that we also intuitively derived a formula for variance? Going forward you will see how we use the ideas we just talked about to measure skewness and kurtosis.
Let's see how we can use the idea of moments we talked about earlier to figure out how we can measure skewness (which you already have some idea about) and kurtosis.
Let's take the idea of moments we talked about just now and try to calculate the third moment. As you might have guessed, we can calculate the cubes of our distances. But as we discussed above, we are more interested in seeing the additional information the third moment provides.
So we want to subtract the second moment from our third moment. Later, we will also refer to this factor as the adjustment to the moment. So our adjusted moment will look like this:
\[ skewness = \frac{\sum (a_n  \mu)^3 }{n \cdot \sigma ^3} \]
This adjusted moment is what we call skewness. It helps us measure the asymmetry in the data.
Perfectly symmetrical data would have a skewness value of 0. A negative skewness value implies that a distribution has its tail on the left side of the distribution, while a positive skewness value has its tail on the on the right side of the distribution.
Positive skew and negative skew
At this stage, it might seem like calculating skewness would be pretty tough to do since in the formulas we use the population mean \( \mu \) and the population standard deviation \( \sigma \) which we wouldn't have access to while taking a sample.
Instead, you only have the sample mean and the sample standard deviation, so we will soon see how you can use these.
As you might have guessed, this time we will calculate our fourth moment or use the fourth power of our distances. And like we talked about earlier we are interested in seeing the additional information this provides so we will also subtract out the adjustment factor from it.
This is what we call kurtosis or a measure of whether our data has a lot of outliers or very few outliers. This will look like:
\[ kurtosis = \frac{\sum (a_n  \mu)^4 }{n \cdot \sigma ^4} \]
A better term for what's going on here is to figure out if the distribution is heavytailed or lighttailed. We can compare this to a normal distribution.
If you do a simple substitution you'll see that the kurtosis for normal distribution is 3. And since we are interested in comparing kurtosis to the normal distribution, often we use excess kurtosis which simply subtracts 3 from the above equation.
Positive and negative kurtosis (Adapted from Analytics Vidhya)
This is us essentially trying to force the kurtosis of our normal distribution to be 0 for easier comparison. So, if our distribution has positive kurtosis, it indicates a heavytailed distribution while negative kurtosis indicates a lighttailed distribution. Graphically, this would look something like the image above.
So, a problem with the equations we just built is that they have two terms in them, the distribution mean \( \mu \) and the distribution standard deviation \( \sigma \). But we are taking a sample of observations so we do not have the parameters for the whole distribution. We'd only have the sample mean and the sample standard deviation.
To keep this article focused, we will not be talking in detail about sampling adjustment terms since degrees of freedom is not in the scope of this article.
The idea is to use our sample mean \( \bar{x} \) and our sample standard deviation \( s \) to estimate these values for our distribution. We will also have to adjust our degree of freedom in these equations for it.
Don't worry if you don't understand this concept completely at this point. We can move on anyway. This leads to us modifying the equations we talked about earlier like so:
\[ skewness = \frac{\sum (a_n  \bar{x})^3 }{s^3} \cdot \frac{n}{(n1)(n2)} \]
\[ kurtosis = \frac{\sum (a_n  \bar{x})^4 }{s^4} \cdot \frac{n(n+1)}{(n1)(n2)(n3)}  \frac{3(n1)^2}{(n2)(n3)} \]
Finally, let's finish up by seeing how you can measure skewness and kurtosis in Python with an example. In case you want to follow along and try out the code, you can follow along with this Colab Notebook where we measure the skewness and kurtosis of a dataset.
It is pretty straightforward to implement this in Python with Scipy. It has methods to easily measure skewness and kurtosis for a distribution with prebuilt methods.
The below code block shows how to measure skewness and kurtosis for the Boston housing dataset, but you could also use it for your own distributions.
from scipy.stats import skew
from scipy.stats import kurtosis
skew(data["MEDV"].dropna())
kurtosis(data["MEDV"].dropna())
Thank you for sticking with me until the end. I hope you have learned a lot from this article.
I am excited to see if this article helped you better understand these two very important ideas. If you have any feedback or suggestions for me please feel free to reach out to me on Twitter.
]]>In this article, I will share with you some useful tips and guidelines that you can use to better build better deep learning models. These tricks should make it a lot easier for you to develop a good network.
You can pick and choose which tips you use, as some will be more helpful for the projects you are working on. Not everything mentioned in this article will straight up improve your models performance.
One of the more painful things about training Deep Neural Networks is the large number of hyperparameters you have to deal with.
These could be your learning rate \( \alpha \), the discounting factor \( \rho \), and epsilon \( \epsilon \) if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates \( \beta_1 \) and \( \beta_2 \) if you are using the Adam optimizer (Kingma et al.).
You also need to choose the number of layers in the network or the number of hidden units for the layers. You might be using learning rate schedulers and would want to configure those features and a lot more 😩! We definitely need ways to better organize our hyperparameter tuning process.
A common algorithm I tend to use to organize my hyperparameter search process is Random Search. Though there are other algorithms that might be better, I usually end up using it anyway.
Lets say for the purpose of this example you want to tune two hyperparameters and you suspect that the optimal values for both would be somewhere between one and five.
The idea here is that instead of picking twentyfive values to try out like (1, 1) (1, 2) and so on systematically, it would be more effective to select twentyfive points at random.
Based on Lecture Notes of Andrew Ng
Here is a simple example with TensorFlow where I try to use Random Search on the Fashion MNIST Dataset for the learning rate and the number of units in the first Dense layer:
Here I suspect that an optimal number of units in the first Dense layer would be somewhere between 32 and 512, and my learning rate would be one of 1e2, 1e3, or 1e4.
Consequently, as shown in this example, I set my minimum value for the number of units to be 32 and the maximum value to be 512 and have a step size of 32. Then, instead of hardcoding a value for the number of units, I specify a range to try out.
hp_units = hp.Int('units', min_value = 32, max_value = 512, step = 32)
model.add(tf.keras.layers.Dense(units = hp_units, activation = 'relu'))
We do the same for our learning rate, but our learning rate is simply one of 1e2, 1e3, or 1e4 rather than a range.
hp_learning_rate = hp.Choice('learning_rate', values = [1e2, 1e3, 1e4])
optimizer = tf.keras.optimizers.Adam(learning_rate = hp_learning_rate)
Finally, we perform Random Search and specify that among all the models we build, the model with the highest validation accuracy would be called the best model. Or simply that getting a good validation accuracy is the goal.
tuner = kt.RandomSearch(model_builder,
objective = 'val_accuracy',
max_trials = 10,
directory = 'random_search_starter',
project_name = 'intro_to_kt')
tuner.search(img_train, label_train, epochs = 10, validation_data = (img_test, label_test))
After doing so, I also want to retrieve the best model and the best hyperparameter choice. Though I would like to point out that using the get_best_models
is usually considered a shortcut.
To get the best performance you should retrain your model with the best hyperparameters you get on the full dataset.
# Which was the best model?
best_model = tuner.get_best_models(1)[0]
# What were the best hyperparameters?
best_hyperparameters = tuner.get_best_hyperparameters(1)[0]
I wont be talking about this code in detail in this article, but you can read about it in this article I wrote some time back if you want.
The bigger your neural network is, the more accurate your results (in general). As model sizes grow, the memory and compute requirements for training these models also increase.
The idea with using Mixed Precision Training (NVIDIA, Micikevicius et al.) is to train deep neural networks using halfprecision floatingpoint numbers which let you train large neural networks a lot faster with no or negligible decrease in the performance of the networks.
But, Id like to point out that this technique should only be used for large models with more than 100 million parameters or so.
While mixedprecision would run on most hardware, it will only speed up models on recent NVIDIA GPUs (for example Tesla V100 and Tesla T4) and Cloud TPUs.
I want to give you an idea of the performance gains when using Mixed Precision. When I trained a ResNet model on my GCP Notebook instance (consisting of a Tesla V100) it was almost three times better in the training time and almost 1.5 times on a Cloud TPU instance with almost no difference in accuracy. The code to measure the above speedups was taken from this example.
To further increase your training throughput, you could also consider using a larger batch size and since we are using float16 tensors you should not run out of memory.
It is also rather easy to implement Mixed Precision with TensorFlow. With TensorFlow you could easily use the tf.keras.mixed_precision Module that allows you to set up a data type policy (to use float16) and also apply loss scaling to prevent underflow.
Here is a minimalistic example of using Mixed Precision Training on a network:
In this example, we first set the dtype
policy to be float16 which implies that all of our model layers will automatically use float16.
After doing so we build a model, but we override the data type for the last or the output layer to be float32 to prevent any numeric issues. Ideally, your output layers should be float32.
Note: Ive built a model with so many units so we can see some difference in the training time with Mixed Precision Training since it works well for large models.
If you are looking for more inspiration to use Mixed Precision Training, here is an image demonstrating speedup for multiple models by Google Cloud on a TPU:
Speedups on a Cloud TPU
In multiple scenarios, I have had to custom implement a neural network. And implementing backpropagation is typically the aspect thats prone to mistakes and is also difficult to debug.
With incorrect backpropagation your model could learn something which might look reasonable, which makes it even more difficult to debug. So, how cool would it be if we could implement something which could allow us to debug our neural nets easily?
I often use Gradient Check when implementing backpropagation to help me debug it. The idea here is to approximate the gradients using a numerical approach. If it is close to the calculated gradients by the backpropagation algorithm, then you can be more confident that the backpropagation was implemented correctly.
As of now, you can use this expression in standard terms to get a vector which we will call \( d \theta [approx] \):
Calculate approx gradients
In case you are looking for the reasoning behind this, you can find more about it in this article I wrote.
So, now we have two vectors \( d \theta [approx] \) and \( d \theta \) (calculated by backprop). And these should be almost equal to each other. You could simply compute the Euclidean distance between these two vectors and use this reference table to help you debug your nets:
A Reference table
Caching datasets is a simple idea but its not one I have seen used much. The idea here is to go over the dataset in its entirety and cache it either in a file or in memory (if it is a small dataset).
This should save you from performing some expensive CPU operations like file opening and data reading during every single epoch.
This does also means that your first epoch would comparatively take more time📉 since you would ideally be performing all operations like opening files and reading data in the first epoch and then caching them. But the subsequent epochs should be a lot faster since you would be using the cached data.
This definitely seems like a very simple to implement idea, right? Here is an example with TensorFlow showing how you can very easily cache datasets. It also shows the speedup 🚀 from implementing this idea. Find the complete code for the below example in this gist of mine.
A simple example of caching datasets and the speedup with it
When youre working with neural networks, overfitting and underfitting might be two of the most common problems you face. This section talks about some common approaches that I use when tackling these problems.
You might know this, but high bias will cause you to miss a relationship between features and labels (underfitting) and high variance will cause the model to capture the noise and overfit to the training data.
I believe the most effective way to solve overfitting is to get more data though you could also augment your data. A benefit of deep neural networks is that their performance improves as they are fed more and more data.
But in a lot of situations, it might be too expensive to get more data or it simply might not be possible to do so. In that case, lets talk about a couple of other methods you could use to tackle overfitting.
Apart from getting more data or augmenting your data, you could also tackle overfitting either by changing the architecture of the network or by applying some modifications to the networks weights. Lets look at these two methods.
A simple way to change the architecture such that it doesnt overfit would be to use Random Search to stumble upon a good architecture. Or you could try pruning nodes from your model, essentially lowering the capacity of your model.
We already talked about Random Search, but in case you want to see an example of pruning you could take a look at the TensorFlow Model Optimization Pruning Guide.
In this section, we will see some methods I commonly use to prevent overfitting by modifying a networks weights.
Iterating back on what we discussed, simpler models are less likely to overfit than complex ones. We try to keep a bar on the complexity of the network by forcing its weights only to take small values.
To do so we will add to our loss function a term that can penalize our model if it has large weights. Often \( L_1 \) and \( L_2 \) regularizations are used, the difference being:
L1 The penalty added is \( \propto \) to \( weight coefficients \)
L2 The penalty added is \( \propto \) to \( weight coefficients^2 \)
where \( x \) represents absolute values.
Do you notice the difference between L1 and L2, the square term? Due to this, L1 might push weights to be equal to zero whereas L2 would have weights tending to zero but not zero.
In case you are curious about exploring this further, this article goes deep into regularizations and might help.
This is also the exact reason why I tend to use L2 more than L1 regularization. Lets see an example of this with TensorFlow.
Here I show some code to create a simple Dense layer with 3 units and the L2 regularization:
import tensorflow as tf
tf.keras.layers.Dense(3, kernel_regularizer = tf.keras.regularizers.L2(0.1))
To provide more clarity on what this does, as we discussed above this would add a term ( \( 0.1 \times weight coefficient value^2 \) ) to the loss function which works as a penalty to very big weights. Also, it is as easy as replacing L2 to L1 in the above code to implement L1 for your layer.
The first thing I do when I am building a model and face overfitting is try using dropouts (Srivastava et al.). The idea here is to randomly drop out or set to zero (ignore) x% of output features of the layer during training.
We do this to stop individual nodes from relying on the output of other nodes and prevent them from coadapting from other nodes too much.
Dropouts are rather easy to implement with TensorFlow since they are available as layers. Here is an example of me trying to build a model to differentiate images of dogs and cats with Dropout to reduce overfitting:
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(32, (3,3), padding='same', activation='relu',input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Conv2D(128, (3,3), padding='same', activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Dropout(0.2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(512, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
As you could see in the code above, you could directly use tf.keras.layers.dropout
to implement the dropout, passing it the fraction of output features to ignore (here 20% of the output features).
Early stopping is another regularization method I often use. The idea here is to monitor the performance of the model at every epoch on a validation set and terminate the training when you meet some specified condition for the validation performance (like stop training when loss < 0.5)
It turns out that the basic condition like we talked about above works like a charm if your training error and validation error look something like in this image. In this case, Early Stopping would just stop training when it reaches the red box (for demonstration) and would straight up prevent overfitting.
It (Early stopping) is such a simple and efficient regularization technique that Geoffrey Hinton called it a beautiful free lunch.
HandsOn Machine Learning with ScikitLearn and TensorFlow by Aurelien Geron
Adapted from Lutz Prechelt
However, for some cases, you would not end up with such straightforward choices for identifying the criterion or knowing when Early Stopping should stop training the model.
For the scope of this article, we will not be talking about more criteria here, but I would recommend that you check out Early Stopping But When, Lutz Prechelt which I use a lot to help decide criteria.
Lets see an example of Early Stopping in action with TensorFlow:
import tensorflow as tf
callback = tf.keras.callbacks.EarlyStopping(monitor='loss', patience=3)
model = tf.keras.models.Sequential([...])
model.compile(...)
model.fit(..., callbacks = [callback])
In the above example, we create an Early Stopping Callback and specify that we want to monitor our loss values. We also specify that it should stop training if it does not see noticeable improvements in loss values in 3 epochs. Finally, while training the model, we specify that it should use this callback.
Also, for the purpose of this example, I show a Sequential model but this could work in the exact same manner with a model created with the functional API or subclassed models, too.
Thank you for sticking with me until the end. I hope you will benefit from this article and incorporate these tips in your own experiments.
I am excited to see if they help you improve the performance of your neural nets, too. If you have any feedback or suggestions for me please feel free to reach out to me on Twitter.
]]>Hello developers 👋, If you have worked on building Deep Neural Networks earlier you might know that building neural nets can involve setting a lot of different hyperparameters. In this article, I will share with you some tips and guidelines you can use to better organize your hyperparameter tuning process which should make it a lot more efficient for you to stumble upon a good setting for the hyperparameters.
Very simply a hyperparameter is external to the model that is it cannot be learned within the estimator, and whose value you cannot calculate from the data.
Many models have important parameters which cannot be directly estimated from the data. This type of model parameter is referred to as a tuning parameter because there is no analytical formula available to calculate an appropriate value.
Page 64, 65, Applied Predictive Modeling, 2013
The hyperparameters are often used in the processes to help estimate the model parameters and are often to be specified by you. In most cases to tune these hyperparameters, you would be using some heuristic approach based on your experience, maybe starter values for the hyperparameters or find the best values by trial and error for a given problem.
As I was saying at the start of this article, one of the painful things about training Deep Neural Networks is the large number of hyperparameters you have to deal with. These could be your learning rate , the discounting factor , and epsilon if you are using the RMSprop optimizer (Hinton et al.) or the exponential decay rates and if you are using the Adam optimizer (Kingma et al.). You also need to choose the number of layers in the network or the number of hidden units for the layers, you might be using learning rate schedulers and would want to configure that and a lot more😩! We definitely need ways to better organize our hyperparameter tuning process.
Usually, we can categorize hyperparameters into two groups: the hyperparameters used for training and those used for model design🖌. A proper choice of hyperparameters related to model training would allow neural networks to learn faster and achieve enhanced performance making the tuning process, definitely something you would want to care about. The hyperparameters for model design are more related to the structure of neural networks a trivial example being the number of hidden layers and the width of these layers. The model training hyperparameters in most cases could well serve as a way to measure a models learning capacity🧠.
During the training process, I usually give the most attention to the learning rate , and the batch size because these determine the speed of convergence so you should consider tuning them first or give them more attention. However, I do strongly believe that for most models the learning rate would be the most important hyperparameter to tune consequently deserving more attention. We will later in this article discuss ways to select the learning rate. Also do note that I mention usually over here, this could most certainly change according to the kind of applications you are building.
Next up, I usually consider tuning the momentum term in RMSprop and others since this helps us reduces the oscillation by strengthening the weight updates in the same direction, also allowing us to decrease the change in different directions. I often suggest using = 0.9 which works as a very good default and is most often used too.
After doing so I would try and tune the number of hidden units for each layer followed by the number of hidden layers which essentially help change the model structure followed by the learning rate decay which we will soon see. Note: The order suggested in this paragraph has seemed to work well for me and was originally suggested by Andrew Ng.
Furthermore, a really helpful summary about the order of importance of hyperparameters from lecture notes of Ng was compiled by Tong Yu and Hong Zhu in their paper suggesting this order:
learning rate
Momentum , for RMSprop, etc.
Minibatch size
Number of hidden layers
learning rate decay
Regularization
The things we talk about under this section would essentially be some things I find important and apply while the tuning of any hyperparameter. So, we will not be talking about things related to tuning for a specific hyperparameter but concepts that apply to all of them.
Random Search and Grid Search (a precursor of Random Search) are by far the most widely used methods because of their simplicity. Earlier it was very common to sample the points in a grid and then systematically perform an exhaustive search on the hyperparameter set specified by users. And this works well and is applicable for several hyperparameters with limited search space. In the diagram here as we mention Grid Search asks us to systematically sample the points and try out those values after which we could choose the one which best suits us.
To perform hyperparameter tuning for deep neural nets it is often recommended to rather choose points at random. So in the image above, we choose the same number of points but do not follow a systemic approach to choosing those points like on the left side. And the reason you often do that is that it is difficult to know in advance which hyperparameters are going to be the most important for your problem. Let us say the hyperparameter 1 here matters a lot for your problem and hyperparameter 2 contributes very less you essentially get to try out just 5 values of hyperparameter 1 and you might find almost the same results after trying the values of hyperparameter 2 since it does not contribute a lot. On other hand, if you had used random sampling you would more richly explore the set of possible values.
So, we could use random search in the early stage of the tuning process to rapidly narrow down the search space, before we start using a guided algorithm to obtain finer results that go from a coarse to fine sampling scheme. Here is a simple example with TensorFlow where I try to use Random Search on the Fashion MNIST Dataset for the learning rate and the number of units:
Hyperband (Li et al.) is another algorithm I tend to use quite often, it is essentially a slight improvement of Random Search incorporating adaptive resource allocation and earlystopping to quickly converge on a highperforming model. Here we train a large number of models for a few epochs and carry forward only the topperforming half of models to the next round.
Early stopping is particularly useful for deep learning scenarios where a deep neural network is trained over a number of epochs. The training script can report the target metric after each epoch, and if the run is significantly underperforming previous runs after the same number of intervals, it can be abandoned.
Here is a simple example with TensorFlow where I try to use Hyperband on the Fashion MNIST Dataset for the learning rate and the number of units:
I am particularly interested in talking more about choosing an appropriate learning rate since for most learning applications it is the most important hyperparameter to tune consequently also deserving more attention. Having a constant learning rate is the most straightforward approach and is often set as the default schedule:
optimizer = tf.keras.optimizers.Adam(*learning_rate = 0.01*)
However, it turns out that with a constant LR, the network can often be trained to a sufficient, but unsatisfactory accuracy because the initial value could always prove to be larger, especially in the final few steps of gradient descent. The optimal learning rate would depend on the topology of your loss landscape, which is in turn dependent on both the model architecture and the dataset. So we can say that an optimal learning rate would give us a steep drop in the loss function. Decreasing the learning rate would decrease the loss function but it would do so at a very shallow rate. On other hand increasing the learning rate after the optimal one will cause the loss to bounce about the minima. Here is a figure to sum this up:
You now understand why it is important to choose a learning rate effectively and with that lets talk a bit about updating your learning rates while training or setting a schedule. A prevalent technique, known as learning rate annealing, is often used, it recommends starting with a relatively high learning rate and then gradually lowering the learning rate📉 during training. As an example, I could start with a learning rate of 10 when the accuracy is saturated or I reach a plateau we could lower the learning rate to lets say 10 and maybe then to 10 if required.
In training deep networks, it is usually helpful to anneal the learning rate over time. Good intuition to have in mind is that with a high learning rate, the system contains too much kinetic energy and the parameter vector bounces around chaotically, unable to settle down into deeper, but narrower parts of the loss function.
Stanford CS231n Course Notes by FeiFei Li, Ranjay Krishna, and Danfei Xu
Going back to the example above, I suspect that my learning rate should be somewhere between 10 and 10 so if I simply update my learning rate uniformly across this range, we use 90% of the resource for the range 10 to 10 which does not make sense. You could rather update the LR on the log scale, this allows us to use an equal amount of resources for 10 to 10 and between 10 to 10. Now it should be super easy for you😎 to understand exponential decay which is a widely used LR schedule (Li et al.). An exponential schedule provides a more drastic decay at the beginning and a gentle decay when approaching convergence.
Here is an example showing how we can perform exponential decay with TensorFlow:
Also, note that the initial values are influential and must be carefully determined, in this case, though you might want to use a comparatively large value because it will decay during training.
To better help us understand why I will later suggest some good default values for the momentum term, I would like to show a bit about why the momentum term is used in RMSprop and some others. The idea behind RMSprop is to accelerate the gradient descent like the precursors Adagrad (Duchi et al.) and Adadelta (Zeiler et al.) but gives superior performance when steps become smaller. RMSprop uses exponentially weighted averages of the squares instead of directly using w and b:
Now you might have guessed until now the moving average term should be a value between 0 and 1. In practice, most of the time 0.9 works well (also suggested by Geoffrey Hinton) and I would say is a really good default value. You would often consider trying out a value between 0.9 (averaging across the last 10 values) and 0.999 (averaging across the last 1000 values). Here is a really wonderful diagram to summarize the effect of . Here the:
redcolored line represents = 0.9
greencolored line represents = 0.98
As you can see, with smaller numbers of , the new sequence turns out to be fluctuating a lot, because were averaging over a smaller number of examples and therefore are much closer to the noisy data. However with bigger values of beta, like , we get a much smoother curve, but its a little bit shifted to the right because we average over a larger number of examples. So in all 0.9 provides a good balance but as I mentioned earlier you would often consider trying out a value between 0.9 and 0.999.
Just as we talked about searching for a good learning rate and how it does not make sense to do this in the linear scale rather we do this in the logarithmic scale in this article. Similarly, if you are searching for a good value of it would again not make sense to perform the search uniformly at random between 0.9 and 0.999. So a simple trick might be to instead search for 1 in the range 10 to 10 and we will search for it in the log scale. Heres some sample code to generate these values in the log scale which we can then search across:
r = 2 * np.random.rand() # gives us random values between 2, 0
r = r  1 # convert them to 3, 1
beta = 1  10**r
I will not be talking about some specific rules or suggestions about the number of hidden layers and you will soon understand why I do not do so (by example).
The number of hidden layers d is a pretty critical parameter for determining the overall structure of neural networks, which has a direct influence on the final output. I have seen almost always that deep learning networks with more layers often obtain more complex features and relatively higher accuracy making this a regular approach to achieving better results.
As an example, the ResNet model (He et al.) can be scaled up from ResNet18 to ResNet200 by simply using more layers, repeating the baseline structure according to their need for accuracy. Recently, Yanping Huang et al. in their paper achieved 84.3 % ImageNet top1 accuracy by scaling up a baseline model four times larger!
The number of neurons in each layer w must also be carefully considered after having talked about the number of layers. Too few neurons in the hidden layers may cause underfitting because the model lacks complexity. By contrast, too many neurons may result in overfitting and increase training time.
A couple of suggestions by Jeff Heaton that have worked like a charm could be a good start for tuning the number of neurons. to make it easy to understand here I use w as the number of neurons for the input layer and w as the number of neurons for the output layer.
The number of hidden neurons should be between the size of the input layer and the size of the output layer. w < w < w
The number of hidden neurons should be 2/3 the size of the input layer, plus the size of the output layer. w = 2/3 w + w
The number of hidden neurons should be less than twice the size of the input layer. w < 2 w
And that is it for helping you choose the Model Design Hyperparameters🖌, I would personally suggest you also take a look at getting some more idea about tuning the regularization and has a sizable impact on the model weights if you are interested in that you can take a look at this blog I wrote some time back addressing this in detail:
Yeah! By including all of these concepts I hope you can start better tuning your hyperparameters and building better models and start perfecting the Art of Hyperparameter tuning🚀. I hope you liked this article.
If you liked this article, share it with everyone😄! Sharing is caring! Thank you!
Many thanks to [Alexandru Petrescu](https://www.linkedin.com/in/askingalexander/) for helping me to make this better :)
]]>💡#TensorFlowTip
Use .cache to save on some ops like opening file and reading data during each epoch
See the speedup with .cache
in this image.
Try it out for yourself:
]]>💡 #TensorFlowTip
Use .prefetch
to reduce your step time of training and extracting data
n
the input pipeline is reading the data for n+1
stepSee the speedup with .prefetch
in this image.
Try it for yourself :
]]>Kmeans clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the Kmeans clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
Rather than defining groups before looking at the data, clustering allows you to find and analyze the groups that have formed organically. The Choosing K section below describes how the number of groups can be determined.
This story covers:
Why use Kmeans
The steps involved in running the algorithm
An Python example with toy data set from scratch
The algorithm can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets. Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.
This is a versatile algorithm that can be used for any type of grouping. Some examples of use cases are:
Behavioral segmentation:
Segment by purchase history
Segment by activities on application, website, or platform
Define personas based on interests
Create profiles based on activity monitoring
Inventory categorization:
Group inventory by sales activity
Group inventory by manufacturing metrics
Sorting sensor measurements:
Detect activity types in motion sensors
Group images
Separate audio
Identify groups in health monitoring
Detecting bots or anomalies:
Separate valid activity groups from bots
Group valid activity to clean up outlier detection
In addition, monitoring if a tracked data point switches between groups over time can be used to detect meaningful changes in the data.
KMeans is actually one of the simplest unsupervised clustering algorithm. Assume we have input data points x1,x2,x3,,xn and value of K(the number of clusters needed). We follow the below procedure:
Pick K points as the initial centroids from the data set, either randomly or the first K.
Find the Euclidean distance of each point in the data set with the identified K points cluster centroids.
Assign each data point to the closest centroid using the distance found in the previous step.
Find the new centroid by taking the average of the points in each cluster group.
Repeat 2 to 4 for a fixed number of iteration or till the centroids dont change.
Here Let point a=(a,a) ; b=(b,b) and D be the Euclidean distance between the points then
Mathematically speaking, if each cluster centroid is denoted by c, then each data point x is assigned to a cluster based on
arg (min (c c) D(c,x))
For finding the new centroid from clustered group of points
So its really problematic if you choose the wrong K, so this is what I am talking about.
So this was a clustering problem for 3 cluster, K = 3 but if I choose K = 4, I will get something like this and this will definitely not give me the kind of accuracy of predictions, I want my model too. Again if I choose K = 2, I get something like this
Now this isnt a good prediction boundary too. You or me can figure out the ideal K easily in this data as it is 2 dimensional, we could figure out the K for 3 dimensional data too, but lets say you have 900 features and you reduce them to around a 100 features, you cannot visually spot out the ideal K (these problems usually occur in gene data). So, you want a proper way for figuring or I would say guessing an approximate K. To do this we have may algorithms like cross validation, bootstrap, AUC, BIC and many more. But we will discuss one of the most simple and easy to comprehend algorithm called elbow point.
The basic idea behind this algorithm is that it plots the various values of cost with changing k. As the value of K increases, there will be fewer elements in the cluster. So average distortion will decrease. The lesser number of elements means closer to the centroid. So, the point where this distortion declines the most is the elbow point and that will be our optimal K.
For the same data shown above this will be the cost(SSE sum of squared errors) vs K graph created
So, our ideal K is 3. Let us take another example where ideal K is 4. This is the data
And, this is the cost vs K graph created
So, we easily infer that K is 4.
Here we will use the libraries, Matplotlib to create visualizations and Numpy to perform calculation.
So first we will just plot our data
import matplotlib.pyplot as plt
import numpy as np
from matplotlib import style
style.use('ggplot')
X = np.array([[1, 2],
[1.5, 1.8],
[5, 8 ],
[8, 8],
[1, 0.6],
[9,11]])
plt.scatter(X[:,0], X[:,1], s=150)
plt.show()
To make our work organized we will build a class and proceed as in the steps above and also define a tolerance or threshold value for the SSEs.
class K_Means:
def __init__(self, k=2, tol=0.001, max_iter=300):
self.k = k
self.tol = tol
self.max_iter = max_iter
def fit(self,data):
self.centroids = {}
for i in range(self.k):
self.centroids[i] = data[i]
for i in range(self.max_iter):
self.classifications = {}
for i in range(self.k):
self.classifications[i] = []
for featureset in data:
distances = [np.linalg.norm(featuresetself.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
self.classifications[classification].append(featureset)
prev_centroids = dict(self.centroids)
for classification in self.classifications:
self.centroids[classification] = np.average(self.classifications[classification],axis=0)
optimized = True
for c in self.centroids:
original_centroid = prev_centroids[c]
current_centroid = self.centroids[c]
if np.sum((current_centroidoriginal_centroid)/original_centroid*100.0) > self.tol:
print(np.sum((current_centroidoriginal_centroid)/original_centroid*100.0))
optimized = False
if optimized:
break
def predict(self,data):
distances = [np.linalg.norm(dataself.centroids[centroid]) for centroid in self.centroids]
classification = distances.index(min(distances))
return classification
Now we can do something like:
model = K_Means()
model.fit(X)
for centroid in model.centroids:
plt.scatter(model.centroids[centroid][0], model.centroids[centroid][1],
marker="o", color="k", s=150, linewidths=5)
for classification in model.classifications:
color = colors[classification]
for featureset in model.classifications[classification]:
plt.scatter(featureset[0], featureset[1], marker="x", color=color, s=150, linewidths=5)
plt.show()
Next, you can also use the predict function. So that was about KMeans and we are done with the code 😀
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>You can also check out the talks I gave about this topic in the GitHub repo.
As you will notice we majorly focus on ondevice ML in this blog, Android 11 has a lot of cool updates for ondevice ML but lets talk in brief about why you should care about it, you will also understand why there is such hype about ondevice ML or ML on edge.
The idea behind ondevice ML
The idea here is to use ML such that opposed to the traditional approach I no longer send my data to a server or some cloudbased system which then does the ML and then returns me the outputs. I instead get outputs or inferences from my ML model on the device itself that is I no longer send data out of my device itself. I also do all the processing and the inference on the mobile device itself.
The idea behind ondevice ML
You would not directly use the model for your edge device. You would need to compress it or optimize the model so you can run it on the edge device as it has limited computation power, network availability, and disk space. In this document, however, we will not be discussing the optimization process. We will be deploying a .tflite
model file. You can read more about TensorFlow Lite and the Model Optimization process with TensorFlow Lite.
Advantages of ondevice ML
Here I have listed some advantages of using ondevice ML:
So the first thing that would come to your mind is power consumption, you spend a lot of power sending or streaming video data continuously to a server and sometimes it becomes infeasible to do so. However, also worth a mention sometimes the opposite could also be true when you employ heavy preprocessing.
Another important thing to consider is the time it takes me to get the output or essentially run the model. For realtime applications, this is a pretty important aspect to consider. Without sending the data and having to receive it back I speed up my inference time too.
Using the traditional approach is also expensive in terms of network availability. I should have the bandwidth or network to continuously send the data and receive inferences from the server.
And finally security I no longer send data to a server or cloudbased system, I no longer send data out of the device at all thus enforcing security.
Note: You need Android Studio 4.1 or above to be able to use the Model Binding Plugin
What does the Model Binding Plugin focus on?
You can make a fair enough guess from the name Model Building so as to what the ML Model Binding Plugin would do allow us to use custom TF Lite Models very easily. This lets developers import any TFLite model, read the input/output signature of the model, and use it with just a few lines of code that calls the opensource TensorFlow Lite Android Support Library.
The ML model binding plugin makes it super easy for you to use a TF model in your app. You essentially have a lot less code to write that calls the TensorFlow Lite Android Support Library. If you have worked with TensorFlow Lite models you maybe know that you first need to convert everything to a ByteArray
you no longer have to convert everything to ByteArray
anymore with the ML Model Binding Plugin.
What I also love about this new plugin is you can easily use make use of GPUs and the NN API very easily. With the model binding plugin using them has never been easier. Using them is now just a dependency call and a single line of code away isnt that cool what you can do with the Model Binding plugin. With Android 11 The Neural Network API you also have unsigned integer weight support and a new Quality of Service (QOS) API too supporting even more edge scenarios. And Of course, this would make your development a lot faster with the features we just talked about.
Using the Model Binding Plugin
Let us now see how we can implement all that we talked about.
So the first step is to import a TensorFlow Lite model with metadata. Android Studio now has a new option for importing the TensorFlow model, just right click on the module you want to import it in and you will see an option under others
called TF Lite model
.
The import model option in Android Studio
You can now just pass in the path of your tflite
model, it will import the model for you in a directory in the module you selected earlier called ml
from where you will be able to use the model. Adding the dependencies and GPU acceleration too is just a click away.
Importing a tflite
model
So now from my model metadata, I can also know the input, output shapes, and a lot more that I would need to use it, you can see this info by opening the tflite
model file in Android Studio. So in this screenshot, I am using an opensource model made by me to classify between rock, paper, and scissors. So you just show your hand in front of the camera and it identifies if it's a rock paper or scissor, and that's what I demonstrate here too.
Viewing the model metadata
Lets finally start using the model, so for a streaming inference which is most probably what you would want to do; live image classification. The easiest way would be to use Camera X and pass each frame to a function which can perform the inference. So what Im interested as of now is the function which does the inference. You will see how easy it is to do this, a sample code for this also seems when you import a TF Lite Model which you can use.
private val rpsModel = RPSModel.newInstance(ctx)
So well start by instantiating a rps
model short for a rock paper scissors model and pass it the context. With the plugin, my model name was RPS Model.tflite
so a class of the exact same name would be made for you so I have a class called RPS Model
.
val tfImage = TensorImage.fromBitmap(toBitmap(imageProxy))
Once you do this you need to convert your data into a form which we can use so well convert it to a Tensor Image
from bitmap
, if you used the TF Interpreter you know that you need to convert your image to a ByteArray
, you don't need to do that anymore and youll feed in an image proxy
val outputs = rpsModel.process(tfImage)
.probabilityAsCategoryList.apply {
sortByDescending { it.score } // Sort with highest confidence first
}.take(MAX_RESULT_DISPLAY) // take the top results
So now we will pass in the data to the model so first we will process the image from the model and get the outputs we will essentially get an array of probabilities and perform a descending sort on it as we want to show the label which has most probability and then pick first n
results to show.
for (output in outputs) {
items.add(
Recognition(
output.label,
output.score
)
)
}
And finally, I want to show users the labels so I will add the label corresponding to each entry in the outputs. And thats all you need 🚀
Leveraging GPU Acceleration
If you want to use GPU acceleration again it is made very easy for you so you will make an options
object where I specify it to use GPU and build it. In the instantiation part, I would just pass this in as an argument and you can use the GPU. It also makes it very easy to use the NN API for acceleration to do even more and with Android 11.
private val options = Model.Options.Builder().setDevice(Model.Device.GPU).build()
private val rpsModel = rpsModel.newInstance(ctx, options)
You now no longer need a Firebase Project to work with the ML Kit, it is now available even out of Firebase.
The other notable update Another way to implement a TensorFlow Lite model is via ML Kit. And before I move on ML Kit is now available even without having to use a Firebase project, you can now use ML Kit even without a Firebase project.
As I mentioned earlier a lot of updates in Android 11 are focused on ondevice ML due to the benefits I mentioned earlier. The new ML Kit now has better usability for ondevice ML. The ML Kit image classification and object detection and tracking (ODT) now also support custom models, which means now you can also have a tflite
model file along with this. This also means if you are working on some generic use case like a specific kind of object detection ML Kit is the best thing to use.
Using the ML Kit
Lets see this in code and see an example of this. So here as an example I build a model which can classify different food items,
private localModel = LocalModel.Builder()
.setAssetFilePath("litemodel_aiy_vision_classifier_food_V1_1.tflite").
.build()
So I will first start off by setting the model and specifying the tflite
model file path for it.
private val customObjectDetectorOptions = CustomObjectDetectorOptions
.Builder(localModel)
.setDetectorMode(CustomObjectDetectorOptions.STREAM_MODE)
.setClassificationConfidenceThreshold(0.8f)
.build()
This tflite
model will then run on top of the Object detection model with ML Kit so you can customize these options a bit. Here I have specifically used the STREAM_MODE
as I want to work with streaming input and also specify the confidence threshold.
private val objectDetector = ObjectDetection.getClient(customObjectDetectorOptions) objectDetector.process(image)
.addOnFailureListener(Log.d(...))
.addOnSuccessListener{
graphicsOverlay.clear()
for (detectedObject in it){
graphicsOverlay.add(ObjectGraphic(graphicsOverlay, detectedObject))
}
graphicsOverlay.postInvalidate()}
.addOnCompleteListenerl imageProxy.close() }
So let us get to the part where we run the model so you might see some syntax similar to the previous example here. I will process my image and a thing to note here is all of these listeners that are on failure or on success are essential tasks so they need to be attached for every run. And that is all you need to do, we are done 🚀
We talked a lot about what after making a model let us take a look at how you can find models for your usecases.
TF Lite Model Maker too was announced by The TensorFlow Team earlier in 2020. This makes making good models super easy to use, gives high performance, and also allows for a good amount of customization. You can simply pass in the data and use little code to build a tflite
model. You can take a look at the TensorFlow Lite Model Maker Example present in the repo.
TensorFlow Hub is an opensource repository of state of the art and well documented, models. The food classification app we built with ML Kit is also present on TF Hub. You also get to use models from the community. You can find these at tfhub.dev.
A few publishers on tfhub.dev
Filters in TF Hub
You can search for models in TF Hub with a number of filters like Problem Domain if you only want to find image or textbased models, the Model format if you want to run it on the web, on edge devices, or on Corals, filter the architecture, datasets used and much more.
You can further directly download these models from TF Hub or also perform transfer learning on them with your own data very easily. However, for the scope of this blog, we will not be covering Transfer Learning with TF Hub, you can know more about it in this blog by me.
And many more! There are a lot of services like Teachable Machine, AutoML, and many more but these are the major ones.
All the code demonstrated here with examples about TF Lite Model Maker is present in this repo. I have also included a few trained models for you in the repo to get started and experiment with for starters. Rishitdagli/MLwithAndroid11 A repository demonstrating all thats new in Android 11 for ML and how you could try it out for your own usecasesgithub.com
Hi everyone I am Rishit Dagli. Made something with the content in the repo or this blog post, tag me on Twitter, and share it with me!
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>All the code used here is available in my GitHub repository here.
This is the fifth part of the series where I post about TensorFlow for Deep Learning and Machine Learning. In the earlier blog post, you saw how you could apply a Convolutional Neural Network for Computer Vision with some reallife data sets. It did the job pretty nicely. This time youre going to work with more complex data and do even more with the data. I believe in handson coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyperparameters and experiment with the code. If you have not read the previous article consider reading it once before you read this one here. This one is more like a continuation of that.
In the previous blog post, we worked with MNIST data which was pretty simple, grayscaled 28 X 28 images, and the thing you want to classify is centered in the image. Reallife data is different, it has more complex images, your subject might be anywhere in the image not necessarily centered. Our dataset had very uniform images too. This time well also work on a larger dataset.
Well be using the Cats vs Dogs dataset to try out these things for ourselves. TensorFlow has something called [ImageDataGenerator
](https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) which simplifies things for us and allows us to directly read the images and place them. So you would first have two directories called train
and validation
directory, each of the directories would have two subdirectories Cats
and Dogs
each of which would have the respective images and auto label them for us. Heres how the directory structure looks
The directory structure
Lets now see this in code. The ImageDataGenerator
is present in tensorflow.keras.preprocessing.image
so first lets go ahead and import it
from tensorflow.keras.preprocessing.image import ImageDataGenerator
Once you do this you can now use the ImageDataGenerator

train_image_generator = ImageDataGenerator(rescale=1./255)
train_data_gen = train_iamge_generator.flow_from_directory(
batch_size=batch_size,
directory=train_dir,
shuffle=True,
target_size=(IMG_HEIGHT,IMG_WIDTH)
class_mode='binary')
We first pass in rescale=1./255
to normalize the images, you can then call the flow_from_directory
the method from that directory and its subdirectories. So in this case taking the above diagram as a reference, you would pass in the Training
directory.
Images in your data might be of different sizes to convert or resize them all into one size by the target_size
. This is a very important step as all inputs to the neural network should be of the size. A nice thing about this code is that the images are resized for you as theyre loaded. So you dont need to preprocess thousands of images on your file system you instead to do it in runtime.
The images will be loaded for training and validation in batches where its more efficient than doing it one by one. You can specify this by the batch_size
, there are a lot of factors to consider when specifying a batch size which we will not be discussing in this blog post. But you can experiment with different sizes to see the impact on the performance.
This is a binary classifier that is it picks between two different things; cats and dogs so we specify that here by the class_mode
.
And thats all you need to read your data and auto label them according to their directories and also do some processing in run time. SO lets do the same for validation data too
validation_image_generator = ImageDataGenerator(rescale=1./255)
val_data_gen = validation_imadata_generator.flow_from_directory(
batch_size=batch_size,
directory= validation_dir,
shuffle=True,
target_size=(IMG_HEIGHT,IMG_WIDTH)
class_mode='binary')
Another great thing about ImageDataGenerator
is there is little or almost no change while building and training the model, so let's build a sample model for the dogs vs cats problem and then compile it.
model = Sequential([
Conv2D(16, (3,3), padding='same', activation='relu',
input_shape=(IMG_HEIGHT, IMG_WIDTH ,3)),
MaxPooling2D(2,2),
Conv2D(32, (3,3), padding='same', activation='relu'),
MaxPooling2D(2,2),
Conv2D(64, (3,3), padding='same', activation='relu'),
MaxPooling2D(2,2),
Flatten(),
Dense(512, activation='relu'),
Dense(1, activation='sigmoid')])
model.compile(optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy'])
You can see that you don't have to do any changes while compiling your model to make it work with ImageDataGenerator
, lets get to the training part now.
history = model.fit_generator(
train_data_gen,
steps_per_epoch=total_train//batch_size,
epochs=epochs,
validation_data=val_data_gen,
validation_steps=total_val//batch_size
)
A difference you would see here is instead of passing the training data directly after loading it, I now pass the train_data_gen
which reads the data from the disk using ImageDataGenerator
and performs the transformations on it. And you can do the same with the validation data too.
All the code we just talked about is implemented in this notebook. The model we will build is not yet a perfect or suitable model and suffers from overfitting, we will see how we can tackle this problem in the next blog in this series.
You can use the Open in Colab
button to directly open the notebook in Colab or even download it and run it on your system.
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>Model creation is definitely an important part of AI applications but it is very important to also know what after training. I will be showing how you could serve TensorFlow models over HTTP and HTTPS and do things like model versioning or model server maintenance easily with TF Model Server. You will also see the steps required for this and the process you should follow. We will also take a look at Kubernetes and GKE to autoscale your deployments.
All the code used for the demos in this blog post and some additional examples are available at this GitHub repo Rishitdagli/GDGAhmedabad2020 My session at Google Developers Group Ahmedabad about Deploying models to production with TensorFlow model server, 30github.com
I also delivered a talk about this at GDG (Google Developers Group) Ahmedabad, find the recorded version here
The slide deck for the same is available here bit.ly/tfserverdeck
I expect that you have worked with some Deep Learning models and made a few models yourself, it could be with Keras too.
Source: me.me
This is a very accurate representation of the ideal scenario, your model works well or does a good job in the testing or development set and when you bring it out in the real world or production that is when problems start coming and things start failing. You further need to ensure your model is up and running all the time and your users have an easy way to interact with the model. It should also work well on all platforms be it web, Android, IoS, or embedded systems. You would need to have a resilient pipeline to serve your model and an efficient way to pass data. These are just a few differences between production and development.
So first you would want to get the model and package it in a way that you can use it in a production environment. You ideally would not put a Jupyter notebook in the same format into production, you can if you wanted to but it is generally not advised to do so.
Then you would want to host your model on some cloudhosted server so you can serve your users, provide them the model whenever they need the model.
Now that you have posted your model on a cloudhosted server you also need to maintain it. One of the main things which I refer to while saying you need to maintain the server is autoscaling. You should be able to handle your users if you experience an increase in number in your users maybe by increasing your resources.
A Diagram showing process of autoscaling
Another thing you now want to ensure is that you have a global availability for your models. This also means that your users in a particular region should not face a high amount of latency.
After this you also need to provide a way for your model to interact with your devices, most probably you would provide an API to work with. This API can then be called by you in or maybe someone could call your model and get the predictions back. And you have another thing to maintain and keep running now an API.
So tomorrow when you or your team make an update to model, you need to ensure your users get the latest version of your model and we will talk about this in great detail in some time. As an example consider you provide some image classification services you would have to update your models constantly so it can predict with good accuracy the newly available images.
Now that you know what it takes to actually deploy your models to a cloudhosted server or serving your models let us see what TF Model server is and more importantly how it can help you in your use case.
TF Serving is a part of TFX or TensorFlow Extended, simply enough TF Extended is an API that is designed to help you to make productionready, Machine Learning systems. We will only be talking about a specific subpart of TFX called TF serving.
Credits: Laurence Moroney
So this is all that you would have to do to create a fullfledged ML solution. What I want you to take a look at is that modeling is not the only part and in fact, is not the major process in creating a deployment. TF Serving helps you do the serving infrastructure part easily.
So maybe you want to run your models on multiple devices, maybe on mobile devices, on low power embedded systems or on the web. Maybe you could use TF Lite to deploy it on mobile devices and embedded systems or create a JavaScript representation of your model and run it directly on the web. Many times it is a better idea to have a centralized model where your devices can send a response, the server would then execute them and send them back to the device that made the call.
A diagram to show how TF Serving works
Remember we talked about model versioning in an architecture like this you would just need to update your new version of the model at the server and all your devices could instantly have access to it. However, if you used the traditional approach like sending out app updates some users have the new model and some have the old model creating a bad experience. In a cloudbased environment to do so you could use dynamic assignment of resources according to the number of users you have.
TF Serving allows you to do this quite easily so you dont run into issues like Oh I just updated my model and my server is down! and things like that. Lets see this in practice now.
So of course before starting you need to install TF Serving. The code examples for this blog in the repo have the installation commands typed out for you. Installing it is pretty straightforward, find installation steps here.
So, you already have a model and the first thing you would do is simply save it in a format usable by TF Serving. The third line here directory_path
shows your model where to save the model and the other two lines just pass the inputs and the outputs
tf.saved_model.simple_save(
keras.backend.get_session(),
directory_path,
inputs = {'input_image': model.input},
outputs = {i.name: i for i in model.outputs}
)
If you navigate to the path where you saved this model you would see a directory structure like this, also I made a directory 1
which is my model version we will see how TF server helps us manage and deploy these versions. And also note your model is saved in a .pb
extension.
Saved model directory
There is another great interface called the saved model CLI which I find pretty useful. This gives you a lot of useful information about your saved model like operation signatures and inputoutput shapes.
!saved_model_cli show dir [DIR] all
Here is sample output showing the information this tool provides
Saved Model CLI Output
So here is how you would then start the model server let us break this down
os.environ["MODEL_DIR"] = MODEL_DIR
%%bash bg
nohup tensorflow_model_server \
rest_api_port = 8501 \
model_name = test \
model_base_path="${MODEL_DIR}" >server.log 2>&1
So the third line here tells it to use the tensorflow model server, of course you would not include the bash magic cell while implementing it in practice i.e. the code line %bash bg
but as Iassume most of you might use Colab I have added that as Colab doesnt provide you a direct terminal.
The fourth line here specifies the port on which you want to run the TF Model Server and is pretty straightforward too.
A thing to noteid that the model_name
will also appear in the URL on which you will be serving your models, so if you have multiple models at action managing your serving model URLs also becomes a lot easier.
The last line here specifies that you want to enable logging and sometimes logs are just so helpful while debugging. I have personally used them quite a lot to figure out errors easily.
Let us now get to the most interesting part which is performing inference over the model.
A thing to keep in mind while performing inference over the model is that while passing the data in, your data should be lists of lists and not just lists. This is in fact an added advantage for developers. Let us see what this means
xs = np.array([[case_1], [case_2] ... [case_n]])
Here each of the case_1
, case_2
case_n
has all the values for your features from x1
, x2
xi
.
data = json.dumps({"signature_name": [SIGNATURE],
"instances": xs.tolist()})
If you know a bit about saved models, you might know about SignatureDef
, for folks who dont know it defines the signature of a computation supported in a TensorFlow graph. So you can have support for I/O for a function. You can easily find that by the saved_model_cli
. In the instances part we will put in the xs
we just created.
Now you can simply make an inference request
json_response = requests.post(
'[http://localhost:8501/v1/models/test:predict](http://localhost:8501/v1/models/helloworld:predict)',
data = data,
headers = headers)
Remember us talking about versions, not the model URL it contains v1
which allows us to easily specify that we want to use version 1 of the model. You can also see the model name test
to be reflected in the URL.
Your headers would simply be something like this as you are passing data in as JSON
headers = {"contenttype": "application/json"}
A lot of you might feel that it would be interesting to see how you would pass in images. It is in fact very easy to do so. In place of case_1
which we saw above, you would just replace that with a list of values that make your image.
Now that you know how TF server works and how you could use it. Having known this, here are a few more things it provides. Now you can understand that if let us say you deploy version 2 of your model, in case of any problems your version 1 will still be active and usable.
Also while the time your new model version is being deployed you have no downtime, your older version will continue to work just normal and these things are often quite helpful. TF Model server helps you in doing so very easily.
A really wonderful part about TF Model Server is to let you focus on writing real code and not worry about infrastructure and managing it and that is something really useful, you do not want to spend time doing these infrastructure things as a developer. This, in turn, allows you to build better ML applications and get them up and running a lot faster.
The things we saw now could be well scaled to the cloud making it even more powerful, you could always have your own server we will not be discussing why Cloud or onpremise here or how load balancing takes place. We will see a brief workflow for deploying the model on cloud with Kubernetes. With TF Model Server this becomes a lot easier.
I will start by assuming you have trained a model and built a docker image for it. The steps to doing this are pretty straightforward and I have also listed them in the GitHub repo for this session. So we will start by creating a Kubernetes cluster with 5 nodes and I will be showing you how you can deploy a simple resnet
model on the cloud so the name.
gcloud container clusters create
resnetservingcluster
numnodes 5
docker tag
$USER/resnet_serving
gcr.io/[PROJECT_ID]/resnet
docker push
gcr.io/[PROJECT_ID]/resnet
You can then push your docker image to Container Registry
And with that you are all ready to create a deployment, the YAML file shown here would create the metadata for deployments like your image and the number of replicas you want, I have included a sample for you in the repo.
kubectl create f [yaml]
Another thing to keep in mind while performing inference with a model hosted on the Cloud you would now use the external IP to make inference requests instead of the localhost
we used earlier.
We started out with why it is worth to have a process for deployment. We then saw what is required to deploy your model, versioning, availability, a global nature, infrastructure, and a lot more. We then saw what TF Model server provides us with and why should you choose TF Model server to make deployments. We then saw the process involved with TF Model server. You also saw how it allows you to write real code and not worry about infrastructure, the version wise URLs, and easy management of them. We then moved to see how this could be replicated on the cloud and saw how Kubernetes could make this easy for us.
I have included a few notebooks for you which implement all that you saw in this post, try them out for yourself.
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>One of the best things about AI is that you have a lot of opensource content, we use them quite frequently. I will show how TensorFlow Hub makes this process a lot easier and allows you to seamlessly use pretrained convolutions or word embeddings in your application. We will then see how we could perform Transfer Learning with TF Hub Models and also see how this can expand to other use cases.
All the code used for demos in this blog post are available at this GitHub repo Rishitdagli/GDGNashik2020 My session at Google Developers Group Nashik 2020. Contribute to Rishitdagli/GDGNashik2020 development by creatinggithub.com
I also delivered a talk about this at GDG (Google Developers Group) Nashik, find the recorded version here
Source: me.me
We all use open source code, we try to make it better do some tweaks, and maybe make it work it for us. I feel this is the power of community you can share your work with people all around the world.
So that is the major thing TF Hub focuses on is makes using open source code, models, datasets super simple.
As a result of using prebuilt content, you can develop applications faster as now you have access to work that has already been done by other people and leverage that.
Not everybody has access to a high power GPU or a TPU and train models for days, you can get started off by building on opensourced content using Transfer learning, we will see how you can leverage that. Say let us say you want to build a model that takes all reviews you receive on your product and see if they were good reviews or bad reviews. You can use word embeddings from a model trained on the Wikipedia dataset and do very well in just some time. This is a very simple example to explain it and we will see more of it later.
Modeling is an important part of any Machine Learning application, you would want to invest quite some time and effort to get this right. Also note I am not saying modeling is the only important thing, in reality, there is a lot like building pipelines, serving models but we will not explore them in this blog. TF Hub helps you to do the model part better, faster, and easy.
tfhub.dev
You can see a GUI and a friendly version of TensorFlow Hub at tfhub.dev, you can see Hub here. You can filter models based on the type of problem you want to work for, the model format you need, and also based on the publisher. This definitely makes the model discovery process a lot easier for you.
You can find a lot of well tested and state of the art models right there on TF Hub. Each of the models there on TF Hub is well documented and thoroughly tested, so you as a developer can make the best use of it. A lot of models on TF Hub even have sample Colab Notebooks to show how they work.
What is even better is that you can easily integrate it with your model APIs, so you want to have flexibility while building models, and TF Hub helps you to do so. You will then see for yourself too how well TF Hub is integrated with Keras or Core TensorFlow APIs which really makes it super easy.
This may not be the first thing why you would like to consider TF Hub but its good to know the wide array of publishers TF Hub, some of the major publishers are listed in the image below.
A few publishers on TF Hub
So a lot of times what happens is your codebases become very dependent or coupled, this can make the experimentation and iteration process a lot slower, so Hub defines artifacts for you which are not dependent on code, so you have a system which allows for faster iteration and experimentation too.
Credits: Sandeep Gupta
Another great thing about TF Hub is it does not matter if you use highlevel Keras APIs or lowlevel APIs. You can also use it in your productiongrade pipelines or systems, it integrates quite well with TensorFlow Extended too. You can also use TF.js models for webbased environments or node environments. Edge is taking off and TF Hub has covered you in that aspect too, you can use TF Lite pretrained models to run your models directly in your mobile device and low power embedded systems. You can also discover models for Coral Edge TPUs. It is essentially just TF Lite and combining it with a powerful edge TPU.
Let us start by talking about transfer learning, the thing which gives TF Hub its power. You might also know, making any model from scratch requires a good algorithm selection and architecture, lot of data, compute and of course domain expertise. That seems like quite a lot.
The idea behind using TF Hub
So what you do with TF Hub is that someone takes all this what is required to make a model from scratch makes what is called a module and then takes out some part of the model which is reusable. So in the case of text, we use word embeddings which give it a lot of power and often require quite some compute to train. In the case of images, it could be features. So you have this reusable part from a module in the TF Hub repo. With transfer learning this reusable piece could be used in your different models and it might even serve a different purpose and this is essentially how you would be using TF Hub. So when I say different purpose it could maybe be a classifier trained on a thousand labels but you just use it to predict or do a distinction between 3 classes. Thats just one example but you now maybe understand it.
Why use Transfer Learning?
Transfer Learning allows you to achieve generalization on your data, this is particularly quite helpful when you want to run your model on reallife data.
As you have your embeddings or weights or convolutions learned prehand you can train highquality models with very little data. This is very helpful when you can just not get any more data or getting more data is too costly.
We already discussed earlier that Transfer Learning would take less time to train.
We will not be focussing on installing TF Hub here it is pretty straightforward, find installation steps here.
You can easily load your model with this piece of code
MODULE_HANDLE = '[https://tfhub.dev/google/imagenet/inception_v3/classification/4](https://tfhub.dev/google/imagenet/inception_v3/classification/4)'
module = hub.load(MODULE_URL)
You can, of course, change the MODULE_HANDLE
URL for the model you need.
So, when you call the model. load
method TF Hub downloads this for you, and you will notice that it is a saved model directory and you have your model as a protobuf file which does the graph definition for you.
There is another great interface called the saved model CLI which I find pretty useful. This gives you a lot of useful information about your saved model like operation signatures and inputoutput shapes.
!saved_model_cli show dir [DIR] all
Here is sample output showing the information this tool provides
Saved Model CLI Output
You can now directly perform inference with your loaded model.
tf.nn.softmax(module([img]))[0]
But this is a bad approach as you are not generalizing your model.
A lot of times it is better to use image feature vectors. They remove the final classification layer from the network and allow you to train a network on top of that allowing for greater generalization and in turn better performance on reallife data.
You could do it simply in this manner
model = tf.keras.Sequential([
hub.KerasLayer(MODULE_HANDLE,
input_shape=IMG_SIZE + (3,),
output_shape=[FV_size],
trainable=True),
tf.keras.layers.Dense(NUM_CLASSES,
activation='softmax')])
So you can see I now use hub.KerasLayer
to create my model as a Keras layer and I also set trainable
to be True
as I want to perform transfer learning.
So we can then have our model add a Dense
layer after it so you are taking the model adding your own layers and then retraining it with the data you have, of course, you could have multiple layers or even convolutional layers on top of this. And then you could fit it like any other Keras model after compiling it. The fitting and compiling process remains the same which shows us the level of integration of TF Hub.
One of the major problems we need to address when converting text to numbers is you need to do it in a way you preserve the meaning or semantic of text. We often use what is called word embeddings to do so. So as you can see above word vectors are as the names suggest vectors and semantically similar words have a similar direction of vectors. The dimensions of the size of the vector are also called embedding dimensions. Most part of the training is about learning these embeddings.
I would load embedding something like this
embedding = "URL"
hub_layer = hub.KerasLayer(embedding,
input_shape=[],
dtype=tf.string,
trainable=True)
The loading part is where things get a bit different, so you would do something like this to specify that you are loading embeddings and then you are all ready to use this in your neural net, in the same manner, we did earlier. So there was not much difference in practice but it is good to know how you could handle word embeddings
Another thing to consider
There is in fact another added feature with this. Lets say I am using an embedding layer that returns me 20dimensional embedding vectors. This means if you pass in a single word it returns you a 1 by 20 array.
Singleword embeddings
Lets say I pass two sentences, I want you to look at the output shape its 2 by 20 the 2 is of course because I passed a list with two items but why do I have a 20. 20dimensional output vector should be there for each word so why do I have a 20dimensional vector for a complete sentence. It intelligently converts these word embeddings to sentence embeddings for you and you no longer have to worry about keeping the shape constant or things like that.
Embeddings for two sentences
I strongly recommend you to try out these examples for yourself, they can be run in Colab itself. Find them at the GitHub repo for this post Rishitdagli/GDGNashik2020 My session at Google Developers Group Nashik 2020. Contribute to Rishitdagli/GDGNashik2020 development by creatinggithub.com
A neural style transfer algorithm would ideally require quite some amount of compute and time, we can use TensorFlow Hub to do this easily. We will start by defining some helper functions to convert images to tensors and viceversa.
def image_to_tensor(path_to_img):
img = tf.io.read_file(path_to_img)
img = tf.image.decode_image(img, channels=3, dtype=tf.float32)
# Resize the image to specific dimensions
img = tf.image.resize(img, [720, 512])
img = img[tf.newaxis, :]
return img
def tensor_to_image(tensor):
tensor = tensor*255
tensor = np.array(tensor, dtype=np.uint8)
tensor = tensor[0]
plt.figure(figsize=(20,10))
plt.axis('off')
return plt.imshow(tensor)
You can then convert the images to tensors, create an arbitrary image stylization model model and we can start building images right away!
hub_module = hub.load('https://tfhub.dev/google/magenta/arbitraryimagestylizationv1256/2')
combined_result = hub_module(tf.constant(content_image_tensor[0]), tf.constant(style_image_tensor[1]))[0]
tensor_to_image(combined_result)
With this, you can now create an image like this(images in GitHub repo)
The output ofNeural Style Transfer
We will now see how we could make a model that uses Transfer Learning. We will try our hands on the IMDB dataset to predict if a comment is positive or negative. We can use the gnewsswivel
embedding which were trained on 130 GB Google News data.
So, we can have our model loaded simply like this
embedding = "https://tfhub.dev/google/tf2preview/gnewsswivel20dim/1"
hub_layer = hub.KerasLayer(embedding, input_shape=[], dtype=tf.string, trainable=**True**)
We can then build a model on top of that
model = tf.keras.Sequential([
hub_layer,
tf.keras.layers.Dense(16, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')])
And then you could train your model like any other Keras model
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>All the LaTex Code used in this blog is compiled on the GitHub repo for this blog: Rishitdagli/SolvingoverfittinginNeuralNetswithRegularization Learn how to use regularization to solve overfitting and the intuitions behind it github.com
The other way to address high variance is to get more training data that is quite reliable. If you get some more training data, you can think of it in this manner that you are trying to generalize your weights for all situations. And that solves the problem most of the time, so why anything else? But a huge downside with this is that you can not always get more training data, it could be expensive to get more data and sometimes it just could not be accessible.
It now makes sense for us to discuss some methods which would help us reduce overfitting. Adding regularization will often help to prevent overfitting. Guess what, there is a hidden benefit with this, often regularization also helps you minimize random errors in your network. Having discussed why the idea of regularization makes sense, let us now understand it.
We will start by developing these ideas for the logistic function. So just recall that you are trying to minimize a function J
called the cost function which looks like this
And your w
is a matrix of size x
and L
is the loss function. Just a quick refresher, nothing new.
So, to add regularization for this we will add a term to this cost function equation, we will see more of this
So is another hyperparameter that you might have to tune, and is called the regularization parameter.
According to the ideal convention w
just means the Euclidean L norm of w
and then just square it, let us summarise this in an equation so it becomes easy for us, we will just simplify the L norm term a bit
I have just expressed it in the terms of the squared euclidean norm of the vector prime to vector w
. So the term we just talked about is called L regularization. Dont worry, we will discuss more about how we got the term in a while but at least you now have a rough idea of how it works. There is actually a reason for this method called L normalization, it is just called so because we are computing the L norm of w
.
Till now, we discussed regularization of the parameter w
, and you might have asked yourself a question, why only w
? Why not add a term with b
? And that is a logical question. It turns out in practice you could add a w
term or do it for w
but we usually just omit it. Because if you look at the parameters, you would notice that w
is usually a pretty high dimensional vector and particularly has a high variance problem. Understand it as w
just has a lot of individual parameters, so you arent fitting all the parameters well, whereas b
is just a single number. So almost all of your major parameters are in w
rather than in b
. So, even if you add that last b
term to your equation, in practice it would not make a great difference.
We just discussed about L regularization and you might also have heard of L regularization. So L regularization is when instead of the term we were earlier talking about you just add L norm of the parameter vector w
. Lets see this term in a mathematical way
If you use L regularization, your w
might end up being sparse, what this means is that your w
vector will have a lot of zeros in it. And it is often said that this can help with compressing the model, because the set of parameters are zero, and you need less memory to store the model. I feel that, in practice, L regularization to make your model sparse, helps only a little bit. So I would not recommend you to use this, not at least for model compression. And when you train your networks, L regularization is just used much much more often.
We just saw how we would do regularization for the logistic function and you now have a clear idea of what regularization means and how it works. So, now it would be good to see how these ideas can extend to neural nets. So, recall or cost function for a neural net, it looks something like this:
So, now recall what we added to this while we were discussing this earlier we added the regularization parameter , a scaling parameter and most importantly the L norm so we will do something similar instead we will just sum it for the layers. So, the term we add would look like this:
Let us now simplify this L norm term for us, it is defined as the sum of the i
over sum of j
, of each of the elements of that matrix, squared:
What the second line here tells you is that your weight matrix or w
here is of the dimensions n, n
and these are the number of units in layer l
and l1
respectively. This is also called the Frobenius norm of a matrix, Frobenius norm is highly useful and is used in a quite a lot of applications the most exciting of which is in recommendation systems. Conventionally it is denoted by a subscript F. You might just say that it is easier to call it just L norm but due to some conventional reasons we call it Frobenius norm and has a different representation
  L norm
_F  Frobenius norm
So, earlier when we would do is compute dw
using backpropagation, get the partial derivative of your cost function J
with respect to w
for any given layer l
and then you just update you w
and also include the parameter. Now that we have got our regularization term into our objective, so we will simply add a term for the regularization. And, these are the steps and equations we made for this
So, the first step earlier used to be just something received from backpropagation and we now added a regularization term to it. The other two steps are pretty much the same as you would do in any other neural net. This new dw[l]
is still a correct definition of the derivative of your cost function, with respect to your parameters, now that you have added the extra regularization term at the end. For this reason that L regularization is sometimes also called weight decay. So, now if you just take the equation from step 1 and substitute it in step 3 equation
So, what this shows is whatever your matrix w[l]
is you are going to make it a bit smaller. This is actually as if you are taking the matrix w
and you are multiplying it by 1  /m
.
So this is why L norm regularization is also called weight decay. Because it is just like the ordinally gradient descent, where you update w
by subtracting times the original gradient you got from backpropagation. But now youre also multiplying w by this thing, which is a little bit less than 1. So, the alternative name for L regularization is weight decay. I do not use this term often, but you now know how the intuition for its name came.
When implementing regularization we added a term called Frobenius norm which penalises the weight matrices from being too large. So, now the question to think about is why does shrinking the Frobenius norm reduce overfitting?
Idea 1
An idea is that if you crank regularisation parameter to be really big, theyll be really incentivized to set the weight matrices w
to be reasonably close to zero. So one piece of intuition is maybe it set the weight to be so close to zero for a lot of hidden units thats basically zeroing out a lot of the impact of these hidden units. And if that is the case, then the neural network becomes a much smaller and simplified neural network. In fact, it is almost like a logistic regression unit, but stacked most probably as deep. And so that would take you from the overfitting case much closer to the high bias case. But hopefully, there should be an intermediate value of that results in an optimal solution. So, to sum up you are just zeroing or reducing out the impact of some hidden layers and essentially a simpler network.
The intuition of completely zeroing out of a bunch of hidden units isnt quite right and does not work too good in practice. It turns out that what actually happens is we will still use all the hidden units, but each of them would just have a much smaller effect. But you do end up with a simpler network and as if you have a smaller network that is therefore less prone to overfitting.
Idea 2
Here is another intuition or idea to regularization and why it reduces overfitting. To understand this idea we take the example of tanh
activation function. So, our g(z) = tanh(z)
.
Here notice that if z
takes on only a small range of parameters, that is z
is close to zero, then youre just using the linear regime of the tanh function. If only if z
is allowed to wander up to larger values or smaller values or z
is farther from 0, that the activation function starts to become less linear. So the intuition you might take away from this is that if , the regularization parameter, is large, then you have that your parameters will be relatively small, because they are penalized being large into a cost function.
And so if the weights of w
are small then because z = wx+b
but if w
tends to be very small, then z
will also be relatively small. And in particular, if z
ends up taking relatively small values, it would cause of g(z)
to roughly be linear. So it is as if every layer will be roughly linear like linear regression. This would make it just like a linear network. And so even a very deep network, with a linear activation function is at the end only able to compute a linear function. This would not make it possible to fit some very complicated decisions.
If you have a neural net and some very complicated decisions you could possibly overfit and this could definitely help to reduce your overfitting.
When you implement gradient descent, one of the steps to debug gradient descent is to plot the cost function J
as a function of the number of elevations of gradient descent and you want to see that the cost function J
decreases monotonically after every elevation of gradient descent. And if youre implementing regularization then remember that J
now has a new definition. If you plot the old definition of J
, then you might not see a decrease monotonically. So, to debug gradient descent make sure that youre plotting this new definition of J
that includes this second term as well. Otherwise, you might not see J
decrease monotonically on every single elevation.
I have found regularization pretty helpful in my Deep Learning models and have helped me solve overfitting quite a few times, I hope they can help you too.
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistakes, suggest improvements, or give feedback you are free to do so by emailing me at
rishit.dagli@gmail.com
hello@rishit.tech
When implementing a neural network, backpropagation is arguably where it is more prone to mistakes. So, how cool would it be if we could implement something which would allow us to debug our neural nets easily. Here, we will see the method of Gradient Checking. Briefly, this methods consists in approximating the gradient using a numerical approach. If it is close to the calculated gradients, then backpropagation was implemented correctly. But there is a lot more to this, let us see that. Sometimes, it can been seen that the network get stuck over a few epochs and then continues to converge quickly. We will also see how we can solve this problem. Let us get started!
All the visualizations and latex code used in the blog is available here Rishitdagli/DebuggingNeuralNets Learn how to debug your neural nets with gradient checking This repository contains the visualization and tex files forgithub.com
In order for us to build up to Gradient Checking we first need to see how we can numerically approximate gradients. I find it easy to explain with an example so let us take one function f() = . Let us see a graph of this function
As you might have guessed a simple graph. Let us as usual start off with some value of for now we say = 1. So, now what we will do is nudge our not just to the right to get ( + ) but also nudge it to the left and get (). For purpose example we will just say = 0.01 Let us now visualize this, ignore the scale in this graph, its just for demonstration purposes.
So now our point which said () is named B and the ( + ) is called C. And in this figure what we earlier used to do was use the triangle DEF, and computed the height over width for it which in this case is EF/DE
. It turns out that you can get a much better estimate for the gradient if you use a bigger triangle from ( + ) and (), to show you what I mean I refer to the triangle here in red
So, there are some papers on Why using this bigger triangle gives us a better approximation of the gradient at , but for the scope of this article I would not go into them in great detail. Simply explaining, you now have two smaller triangles which you are taking into account here by using the bigger triangle
So, we just saw why you should use the bigger triangle instead, having that done let us get to the maths for that triangle. If we see here we can simply say that
H = f() and,
F = f(+)
With these two results you can say that the height(h
) of the bigger triangle is
h = f(+)f()
With similar arguments you can also easily calculate that the width(w
) of this triangle is
w = 2
So, if you know a bit of derivatives you can easily infer
Where g()
refers to the gradient.
Now let us check how true the equation we wrote above is by plugging in the values for the example we just discussed. So, I would get something like this.
Let us now calculate the actual derivative of this. As i know f() = , by simple derivatives g()=3 and we get g() = 3. We did a pretty good approximation with our approximation error just being 0.0001. Now let us see what we would get if we used the one sided traditional approach. If you would calculate it by the one sided difference you would end up with 3.0301
with an approximation error of 0.0301. So, we did a fantastic job here reducing our approximation error significantly!
You just saw that twosided derivatives performed a lot better than the traditional approach. And so this gives you a much greater confidence that g of theta is probably a correct implementation of the derivative of f(). This sounds too good to be true! And there is a downside of this approach too. It turns out to run twice as slow as you were to use a onesided approach. But I believe that in practice it is worth it to use this method because it is just much more accurate.
Let us take a look back at the formal definition of derivatives.
Here an important thing to note is the limits. So, for non zero values of you could show that the error of this approximation is in the order of . And is a very small number which tends to 0. So, to sum up
error = O()
Here O
refers to the order of.
But you could prove by simple mathematics that if you use the one sided derivative your error would be in order of or
error = O()
Again is a very small number of course less than 1, so >> . So now you probably understand why you should use a double sided difference rather than a single sided difference and we will see how this helps us in gradient checking.
Gradient Checking is a pretty useful technique and has helped me easily debug or find bugs in my neural nets many times. And now we will see how you could use this wonderful technique to debug or verify that your implementation and the back propagation is correct. Not something new but let us quickly see a diagram of a single neuron as a refresher
Remember that [x, x ... x_n]
go as input and for each of these we have parameters (w, b) (w, b) (w, b). So to implement gradient checking, the first thing you should do is take all your parameters and reshape them into a giant vector data. So, what you do is take all of these ws and bs and reshape them individually to be a vector. And then you concatenate all of these vectors into one giant vector called . SO now we can write our cost function J
in the terms of like this
So now our J
or cost function would just be a function of .
Now with your ws and bs ordered the same way you could also take dw
, db
dw
, db
and concatenate them into a single big vector which we will call d
of the same dimension as . We will do so with similar procedure that is we shape all dw
s into vectors as they are matrices and b
s are already vectors so then just concatenate them. Just a point which I feel could be helpful is that dimensions of w are same as those of dw and those b are same as that of db and so on. This also means you can use the same sort of resize and concatenation operation. So, now you might have a question, Is d
the gradient or slope of here? We will discuss that in a few moments.
Implementing Grad Check
Remember we wrote J
as a function of let us now write it something like this
Having done this, let us go through a process for any some maybe . So what we would now do is calculate the approximate derivative for precisely a partial derivative to the function J. Also note we would be using the earlier discussed two sided derivative for this. To represent this mathematically
Now by the discussion about double sided derivatives earlier we can also say that this is approximately the partial derivative of J to or
Having this clear to us we can now repeat the same process not just for but for all i
such that i (1, n)
. So a pseudo Python code for this would look like this
So, now we have two vectors d[approx]
and d
. ANd these should almost be equal to each other. But now we would have another problem which is How can I figure out if two vectors are approximately equal?
What I would do now is compute the Euclidean distance between two vectors. What we would do is take the sum of squares of elements of the differences and then compute a square root of this to get the Euclidean distance. The above sentence may seem confusing, and believe me, it is! Read it twice or thrice and move forward once you are very clear with that. We will further apply another method on this, we will normalize the lengths of these vectors. To do this we would simply add their individual lengths and divide the Euclidean distance of the difference with this. If you feel this confusing now, hang on an equation will just help you.
Note: Here I have used the conventional symbol for L norm of the vector. If you are wondering why we normalized this distance, it was just in case one of these vectors was really large or really small.
While implementing Gradient Check across a lot of projects I have observed that value of = 10 or 10 does the work most of the time. So, with the above mentioned similarity formula and this value of you find that formula yields a value smaller than 10 or 10, that is wonderful. It means your derivative approximation is very likely correct. If it is 10 I would say it is okay. But I would double check components of my vectors and check that none of the components are too large and if some of the components of this difference are very large maybe you have a bug. If it gives you 10 then I would be very much concerned and maybe there is a bug somewhere. In case you get bigger values than this something is really wrong! And you should probably look at the individual components of . Doing so you might find a value of d[i]
which is very different from d[approx.]
and use that to track down to find which of your derivative is incorrect.
I made a wonderful table for you to refer while making your ML applications
Reference table
In an ideal scenario what you would do when implementing a neural network, what often happens is you will implement forward propagation, implement backward propagation. And then you might find that this grad check has a relatively big value. And then suspect that there must be a bug, go in, debug, . And after debugging for a while, If you find that it passes grad check with a small value, then you can be much more confident that its then correct (and also get a sense of relief :) ). This particular method has often helped me find bugs in my neural network too and I would advise you to use this while debugging your nets too.
The grad check in each epoch will make it very slow
Looking at the components would help a lot if your algorithm fails grad check. Sometimes it may also give you hints on where the bug might be.
If your J or cost function has an additional regularization term, then you would also calculate your derivatives for that and add them
What dropouts do is they randomly remove some neurons which make it hard to get a cost function on which they perform gradient descent. t turns out that dropout can be viewed as optimizing some cost function J, but its cost function J defined by summing over all exponentially large subsets of nodes they could eliminate in any iteration. So the cost function J is very difficult to compute, and youre just sampling the cost function every time you eliminate different random subsets in those we use dropout. So its difficult to use grad check to double check your computation with dropouts. So what I usually do is implement grad check without dropout. So if you want, you can set keepprob and dropout to be equal to 1.0. And then turn on dropout and hope that my implementation of dropout was correct.
You could also do this with some pretty elegant Taylor expansions but we will not go into that here.
And that was it, you just saw how you could easily debug your neural networks and find problems in them very easily. I hope gradient checking will help you find problems or debug you nets just like it helped me
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by mailing me at
rishit.dagli@gmail.com
hello@rishit.tech
One of the problems of training neural network, especially very deep neural networks, is data vanishing or even exploding gradients. What that means is that when you are training a very deep network your derivatives or your slopes can sometimes get either very very big or very very small, maybe even exponentially small, and this makes training a lot more difficult. This further could even take more time to reach the convergence. We will here see what this problem of exploding or vanishing gradients really means, as well as how you can use careful choices of the random weight initialization to significantly reduce this problem.
Let us say you are training a very deep neural network like this, I have have shown image of it as if you have only have a few hidden units per layer, and a couple of hidden layers but it could be more as well.
So, following the conventional standards we will call our weights for L[1] as w[1], weights for L[2] as w[2] similarly all the way up to the weights for layer L[n] as w[n]. Further to make things simple at this stage let us say we are using a linear activation function such thatg(z) = z
. For now, let us all also ignore the b
and say that b[l] = 0 for alll (1, n)
.
In this case we can now simply write
Now let us make some wonderful observations here, as b = 0,
So, let us now make an expression for a[2], so a[2] would be
So, with this and the earlier equation we get that a[2] = w[1] w[2] X, continuing this a[3] = w[1] w[2] w[3] X and so on.
Exploding Gradients
Lets now just say that all our w[l] are just bigger than identity. As our w of course are matrices it would be better to put them just bigger than identity matrix except w[l] which would have different dimensions than the rest of ws. The thing we just now defined as something greater than identity matrix let that be called C = 1.2 I where I is a 2 X 2 identity matrix. With this we can now easily simplify the above expression to be
if L
was large as it happens for very deep neural network, yhat will be very large. In fact, it just grows exponentially, it grows like 1.2 raised to the number of layers. And so if you have a very deep neural network, the value of y
will just explode.
Vanishing Gradients
Now conversely if we replace the 1.2 here with 0.5 which is something less than 1. This makes our C = 0.5 I . So now you again have 0.5. So, with this the activation values will decrease exponentially as a function of the number of layers L of the network. So in a very deep network, the activations end up decreasing exponentially.
So the intuition I hope you can take away from this is that at the weights w
, if theyre all just a little bit bigger than one or just a little bit bigger than the identity matrix, then with a very deep network the activations can explode. And if w
is just a little bit less than identity and you have a very deep network, the activations will decrease exponentially. And even though I went through this argument in terms of activations increasing or decreasing exponentially as a function of L
, a similar argument can be used to show that the derivatives or the gradients the computer is going to send will also increase exponentially or decrease exponentially as a function of the number of layers.
In todays era for a lot of these tasks you might require a Deep Neural Network with maybe 152 or 256 layer network, in that case these values could get really big or really small. Which would impact your training drastically and make it a lot more difficult. Just consider this scenario, your gradients are exponentially smaller than L, then gradient descent will take tiny little steps and take a long time for gradient descent to learn anything or converge. You just saw how deep networks suffer from the problems of vanishing or exploding gradients. It would now be wonderful and see how we could solve this problem with careful initialization of weights, let us see how we would do that.
It turns out that a solution for this which helps a lot is more careful choice of the random initialization for your neural network. Let us understand this with the help of a single neuron first. Something like this
So, you have your inputs x1
to x4
and then some function a = g(z)
, which outputs your y
. Pretty much the case with any neuron. So, later when we go on to generalize this the x
s would replaced by some layer a[l] but for now we will stick to x
. Not something new, but let us just make a formula to clear our understanding. Note for now we are just ignoring b
by setting b = 0
.
So in order to make z
not blow up or not become too small, you notice that the larger n
is, the smaller you want your w[i]
to be as z
is the sum of w[i]
and x[i]
where n
are the number of input features going in your neuron which is in this example 4. And when you are adding a lot of these terms which is a large n
you want each term to be smaller. You can now do a reasonable thing here which is to set the variance (Note: Variance is here represented according to conventional standards as var()
) of w[i]
to be the reciprocal of your n
. Which simply means var(w[i]) = 1/n
. So, what you would do in practice is
Please note the S
here refers to the shape of the matrix.
As n[l1]
is the number of inputs the neuron at layer l is going to get. If you are using thee most common ReLu activation function it is observed that instead of setting the variance to be 1/n
you should set the variance to be 2/n
instead. So let us summarise it
So, if you are familiar with random variables you might now see as all of this comes together. If you take like a Gaussian random variable and multiplying it by the square root of 2/n[l1]
would give me a variance equal to 2/n
. And if you noticed we earlier were talking in terms of n
and now moved to n[l1]
to generalise it for any layer l as the layer l would have n[l1]
input features for each unit or each neuron in that layer.
And now you have a standard variance of 1 and this would cause your z
to be on a similar scale too. With this your w[l]
is not too much bigger than 1 and not too less than 1. Now, this would definitely help your gradients not to vanish or explode too quickly.
What we just did was assuming a ReLu activation function or a linear activation function. So let us see what would happen if you are using some other activation function.
There is a wonderful paper that shows that when you are using hyperbolic tan or simply tanh it would be a better idea to multiply it by the root of 1/n[l1]
instead of 2/n[l1]
. What I mean is you would have an equation something like this
However, we will not prove this result here. This is more commonly referred to as Xavier initialization or Glorot initialization after the name of its discoverers.
There has also been a variant of this which again has a theoretical justification and can be seen in few papers too
However I would recommend you to use Formula (2) when you are using ReLu which is the most common activation function, if you are using tanh you could try these both out too.
There is one more idea which might help you make your initialization even better. Consider Formula (2) and let us add a parameter to it, not be confused with the SVM hyperparameter . We will multiply this to our original formula
And now you could make another hyperparameter for you to tune but beware that might not have a very huge or drastic change in results, so it would not be a hyperparameter that you would want to go tuning first. However, it may give you minor improvements.
You can easily implement these initializers with some Deep Learning frameworks, here I will be showing how you could do this with TensorFlow. We will be using TensorFLow 2.x for the purpose of example.
import tensorflow as tf
tf.keras.initializers.glorot_normal(seed=**None**)
It draws samples from a truncated normal distribution centered on 0 with stddev = sqrt(2 / (fan_in + fan_out))
where fan_in
is the number of input units in the weight tensor and fan_out
is the number of output units in the weight tensor.
import tensorflow as tf
tf.keras.initializers.glorot_uniform(seed=**None**)
It draws samples from a uniform distribution within [limit, limit]
where limit
is sqrt(6 / (fan_in + fan_out))
where fan_in
is the number of input units in the weight tensor and fan_out
is the number of output units in the weight tensor.
Both of these code snippets return an initializer object and you could very easily use it. You also saw how easy it was to do this with TensorFlow.
I hope this showed you some idea about the problem of vanishing or exploding gradients as well as choosing a reasonable scaling for how you initialize the weights. This would make your weights not explode too quickly and not decay to zero too quickly, so you can train a sufficiently deep network without the weights or the gradients exploding or vanishing too much. When you train deep networks, this is another trick that will help you make your neural networks train much more quickly.
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by mailing me at
rishit.dagli@gmail.com
hello@rishit.tech
These tips were formulated by me when I was participating in 10 Days of ML Challenge, a wonderful initiative by TensorFlow User Group Mumbai to encourage people to learn more or practice more about ML and not necessarily TensorFlow. The flow was such that you would be given a task at the start of the day and you were expected to complete the task and tweet about your results. Tasks could be anything from preprocessing the data to building a model to deploying your model. I would advise you to go over the tasks of the challenge and try them out yourself, it does not matter if you are a beginner or practitioner. I will go on to share my major takeaways from these tasks. All my solutions and the challenge problem statements are open sourced here Rishitdagli/10DaysofML Repository for 10 Days of ML, a TFUG Mumbai initiative. All the code would be pushed to the repo everyday at around 7github.com
The best part of the challenge was that it was not like a core competition but an initiative to promote learning which meant it is completely okay if you do not know a concept or have bugs which you are unable to fix, you could ask it out to the wonderful community and get support. This particularly played a great role for beginners. The still better part about it was that you would also be given relevant links to study a topic that would be required to complete the challenge.
All my outputs for the same are available in this thread
Day 1
We were asked to build some interactive charts for the COVID19 dataset available here. Primarily you had to build a
Country Wise graph
Date Wise graph
Continent Wise graph
A major problem here was the dataset did not have latitude or longitude data with which you could have just passed it in a function to create a continent wise graph. So, what I did was got another country and continent dataset from Kaggle and added a column in my original dataset called Continent. And that solved it. Having so much data I saw some people making huge complex graphs due to which the purpose of graphs, simplifying things and ease of making analysis was kind of defied. You could no longer observe any patterns from it. Here for example a simple scatter plot showed me that rate of growth in some countries was like the textbook exponential graph so I replaced feature x1
by ln(x1)
and it was almost linear!
Make simpler and more understandable graphs, through which you could analyse data and eyeball out few things or patterns instead of complex graphs which do not make much sense and defy the cause.
Day 2
Day 2 was all about feature engineering and data preprocessing on the Titanic dataset available here. Titanic is a pretty good candidate for this as we get to see a lot of variation and low correlation features in the Titanic dataset, making it very important to employ good feature engineering and preprocessing strategies. Further you could also do some data visualization.
Feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.
You could use a lot of methods in this dataset itself like categorical conversion, bucketizing columns, one hot encoding and many more. Try them out yourself.
Day 3
We finally came to the part of building models, we had to predict loan status, cereal ratings and for advanced users a NLP task to analyze toxicity of comments. The first two tasks were quite simple and targeted towards the beginners, working with toxicity data was wonderful. A simple but powerful observation made from the NLP task was that it is best to choose the RNN/ LSTM/ GRU units as a multiple of the vocabulary size.
Try to choose the number of units of your recurrent layers as a multiple of vocabulary size and try to choose your embedding dimensions by a preferred similarity measure. Also do not simply stockpile your recurrent layers, instead try to improve your overall model architecture. Choose your embedding dimensions in multiples of 4 th root of number of features, models do really well.
Day 4
We finally came to unsupervised learning on the Expedia dataset available here. The dataset was pretty huge with over 2 GB of data so some precautions had to be taken while working with such a huge dataset under a constrained environment. You could use some techniques to do so like you should clear your intermediate tensors and variables you could do so easily using
del [variable_name]
You should also try and get your data using parallel processing techniques, while using preprocessing techniques run them in a map function and make best use of your CPU cores. In my approach I had used the cloud to do certain costly operations like PCA or LDA, you should try to do the same too. You should also not jump on algorithms like Restricted Boltzmann Machines, in this case there was a difference of ~0.8% accuracy between KMeans and Boltzmann machines, here it makes more sense to use KMeans and simplify your algorithm rather than a bit of accuracy.
Always get your data and perform prepreprocessing on it in parallel in a
map
Clear your intermediate tensors and variable whenever possible to free up your memory for better performance. You should always prefer simplicity over small changes in accuracy.
Day 5
We were asked to build models for the famous dogs vs cats dataset available here. The idea was to use Convolutional neural networks or CNNs to do the same and figure out the features in the images. I had used two approaches to do the task, one with Transfer Learning from the Inception dataset, freezing some layers and adding some dense layers below this. I retrained the layers I added below the Inception model and ended up with a good accuracy. For the other approach which was about building the network from scratch I used a couple of convolutional and pooling layers which did the job. To prevent overfitting I also used regularization, dropout layers and image augmentation all made easy with TensorFlow. Both of which are present in the repo.
You should always try and use regularization, dropout layers and image augmentation to get over overfitting. Also try and use odd number of kernel size like 3x3 or 5x5. Do not just keep increasing your number of channels, you might get a better training accuracy but overfit the data. Always start by using smaller filters is to collect as much local information as possible, and then gradually increase the filter width to reduce the generated feature space width to represent more global, highlevel and representative information.
Day 6
Day 6 had two tasks for us making a model for Fashion MNIST and XRay images for predicting pneumonia. Again this was an image problem and we were expected to use CNNs.
If the dataset is not much complex like the Fashion MNIST, do not waste your time by making a huge model architecture with 10 layers or so.
The XRay pneumonia problem was a bit tricky and we were also expected to do a bit of feature engineering with the data. I found out some very helpful experimental results while working on this problem which I believe one should follow.
The Batch normalization must generally be placed in the architecture after passing it through the layer containing activation function and before the Dropout layer(if any) . An exception is for the sigmoid activation function wherein you need to place the batch normalization layer before the activation to ensure that the values lie in linear region of sigmoid before the function is applied. Keep the feature space wide and shallow in the initial stages of the network, and the make it narrower and deeper towards the end. (Very Important) Place your dropout layers after max pooling layers. If placed before the MaxPool layer the values removed by Dropout may not affect the output of the MaxPool layer as it picks the maximum from a set of values, therefore only when the maximum value is removed can it be thought of removing a feature dependency.
Day 7
This was all about NLP and again we had two challenges, analysing IMDB reviews, again a very famous dataset and analysing sentiment of comments through the Twitter dataset.
Every LSTM layer should be accompanied by a Dropout layer. This layer will help to prevent overfitting by ignoring randomly selected neurons during training, and hence reduces the sensitivity to the specific weights of individual neurons. 20% is often used as a good compromise between retaining model accuracy and preventing overfitting.
The main challenge here was selecting the number of LSTM/ GRU units. And I used KFold cross validation to do so, if you do not want to dive deep into its workings you might want to use this simplified result
N is the number of input neurons, N the number of output neurons, Ns the number of samples in the training data, and a represents a scaling factor.
The a or here is represents the scaling factor usually between 2 and 10. You can think of as the effective branching factor or number of non zero weights for each neuron. I personally advise you to set the between 5 and 10 for better performance.
Here is a wonderful visualization I created with Embedding projector for the same
Day 8
Clustering documents or clustering text for consumer complaints data available here was the task for 8 th day. The task was to classify complaints data into different categories. I tried with many different algorithms like KMeans, Hierarchical, fuzzy, density clustering and restricted machines and KMeans again gave me the best results. I used the elbow method and the silhouette curve to know the best number of clusters for the same. I would again take this example to show that one should choose simplicity over a bit of performance. Affinity progression in this case gave me the best FowlkeysMallows score which was just a bit higher than that of KMeans so I ended up using KMeans instead.
In the tradeoff between model explainability and performance or accuracy, you should prefer explainability in cases of small differences.
Day 9 and 10
These days were about TensorFlow JS, we had to deploy models which ran in the browser itself. If you do not know much about TF.js you can read my blogs on getting started with it here Getting started with Deep Learning in browser with TF.js If you are a beginner in Machine Learning and want to get started with developing models in the browser this is formedium.com
And we had to build a project of our choice with TF.js. In these days I built two projects, a real time text sentiment analyzer and a web app which could determine your pose. The pose detection was built on top of the Posenet model. Browser based inference or training is to be done with precaution you do not want your process to block your computer or fill it with intermediate variables. A good idea here is to use something like Firebase to host a trained model and call it from your web app. There is indeed a lot of preprocessing to be done when doing real time inference. You should also not block any threads and run your processes asynchronously to not block your UI while your model is being loaded and made ready.
I have seen few people writing all of their JS code between two script
tags in the HTML itself or have all their code in a single index.js
file, now you have only 1 file with all your code. That is a very bad approach from a developers point of view, you should try and separate your code with 1 file for the code to get the data, another to load the model, another to preprocess and so on. This makes it super easy for someone else to understand your code.
It is almost mandatory to use the wonderful
tf.tidy
, what it does keeps your runtime free of the intermediate tensors and other variables you might have created to give you best memory utilization and prevent memory leaks. Run your processes asynchronously, you can simply use the keywords aasync
while defining the function andawait
while calling it, this would not freeze your UI thread and is pretty helpful. I would recommend to shrink the model size by applying weight quantization by quantizing the model weights, we can reduce the size of our model to a fourth of the original size.
The results for the challenge cam out on 9 April, 2020 and I was among the winners 😃. To my delight I was also promoted to a mentor for the group for amazing engagement.
Thanks to Sayak Paul, Ali Mustufa Shaikh, Shubham Sah, Mitusha Arya and Smit Jethwa for their mentorship during the program.
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>Data science is a quite exciting field, you are trying to figure out some patterns in your data, make some cool plots from your data and also gather some insights from it. If you do not know much about data science simply understand it as a field covering data engineering, data analysis, machine learning, visualization, and many more. Just take an example, I have a data which is huge for now lets say something around 50 GB it would take me almost 5 minutes to load this kind of data. What if I made some error in my code and have to execute it again. I wait for 5 minutes again, so it makes sense to use notebooks here. I can execute a code cell, see the output and then run another code cell. To, do this in Python or R we use Jupyter Notebooks. We will further see the tools Kotlin is equipped with to make Data Science a lot easier.
Here instead of an integrated development environments (IDE) that can handle all the development tasks in a single tool we would use similar tools called notebooks as discussed earlier. Notebooks let users conduct research and store it in a single environment. In a notebook, you can write narrative text next to a code block, execute the code block, and see the results in any format that you need: output text, tables, data visualization, and so on. Kotlin provides very easy integration with two popular notebooks: Jupyter and Apache Zeppelin, which both allow you to write and run Kotlin code blocks making the process a lot easier. In this blog we would not be covering how you could use Apache Zeppelin but you can check that out for yourself here.
Let us now see how you would set up a Jupyter kernel for development with Kotlin. For code execution, Jupyter uses the concept of kernels components that run separately and execute the code upon request, for example, when you click Run in a notebook.
There is a kernel that the Jupyter team maintains themselves IPython for running the Python code. However, there are other communitymaintained kernels for different languages. Among them is the Kotlin kernel for Jupyter notebooks. With this kernel, you can write and run Kotlin code in Jupyter notebooks and use thirdparty data science frameworks written in Java and Kotlin.
Before setting up the kernel, I expect you to have Java 8 and Conda and/or pip installed on your system.
conda install c jetbrains kotlinjupyterkernel
Run the above command to install the latest stable release through conda
pip install kotlinjupyterkernel
You could also install it directly from its GitHub repo too
git clone https://github.com/Kotlin/kotlinjupyter.git
cd kotlinjupyter
./gradlew install
Once the kernel is installed you can use any of these command
jupyter console kernel=kotlin
jupyter notebook
jupyter lab
Once in, to start using kotlin
kernel inside Jupyter Notebook or JupyterLab create a new notebook with kotlin
kernel.
Fortunately, there are already plenty of frameworks written in Kotlin for data science. There are even more frameworks written in Java, which is perfect as they can be called from Kotlin code seamlessly.
Below are two short lists of libraries that you may find useful for data science.
Kotlin Libraries
kotlinstatistics is a library providing extension functions for exploratory and production statistics. It supports basic numeric list/sequence/array functions (from sum
to skewness
), slicing operators (such as countBy
, simpleRegressionBy
), binning operations, discrete PDF sampling, naive bayes classifier, clustering, linear regression, and much more.
kmath is a library inspired by NumPy. This is by far my favorite library and the most useful according to me. This library supports algebraic structures and operations, arraylike structures, math expressions, histograms, streaming operations, a wrapper around commonsmath and koma, and more.
krangl is a library inspired by Pythons pandas. This library provides functionality for data manipulation using a functionalstyle API; it also includes functions for filtering, transforming, aggregating, and reshaping tabular data. This creates an entry point for you to start with data science and AI allowing you to easily load tabular data.
letsplot is a plotting library for statistical data written in Kotlin. LetsPlot is multi platform and can be used not only with JVM, but also with JS and Python too. Understand it as cross platform Matplotlib.
Java Libraries
Kotlin provides very easy and seamless interoperability with Java which means you could use all of Javas libraries too! There are a lot of Java libraries which can aid you to plot graphs, due to mathematical calculations, apply Deep Learning and lot more, you can find them all here.
It would be best now to see some ways to use some indispensable libraries namely LetsPlot and Numpy. Lets get started with LetsPlot.
LetsPlot for Kotlin is a Kotlin API for the LetsPlot library an opensource plotting library for statistical data written entirely in Kotlin. LetsPlot was built on the concept of layered graphics implemented in the ggplot2 package for R. LetsPlot for Kotlin is tightly integrated with the Kotlin kernel for Jupyter notebooks. Once you have the Kotlin kernel installed and enabled, add the following line to a Jupyter notebook:
%use letsplot
And thats it you can now use all of letsplot functions and create beautiful graphs.
**KNumpy (Kotlin Bindings for NumPy**) is a Kotlin library that enables calling NumPy functions from the Kotlin code. NumPy is a popular package for scientific computing with Python. It provides powerful capabilities for multidimensional array processing, linear algebra, Fourier transform, random numbers, and other mathematical tasks.
KNumpy provides statically typed wrappers for NumPy functions. Thanks to the functional capabilities of Kotlin, the API of KNumpy is very similar to the one for NumPy. This lets developers that are experienced with NumPy easily switch to KNumpy. Here are two equal code samples in Python and Kotlin.
# Python
import numpy as np
a = np.arange(15).reshape(3, 5)
print(a.shape == (3, 5)) # True
print(a.ndim == 2) # True
print(a.dtype.name) # 'int64'
b = (np.arange(15) ** 2).reshape(3, 5)
// Kotlin
import org.jetbrains.numkt.*
fun main() {
val a = arange(15).reshape(3, 5)
println(a.shape.contentEquals(intArrayOf(3, 5))) // true
println(a.ndim == 2) // true
println(a.dtype) // class java.lang.Integer
// create an array of ints, we square each element and the shape to (3, 5)
val b = (arange(15) `**` 2).reshape(3, 5)
}
An important thing to not is that unlike Python, Kotlin is a statically typed language. This lets you avoid entire classes of runtime errors with KNumpy, the Kotlin compiler detects them at earlier stages.
And that was just the very basics of what Kotlin could do in the field of Data Science and AI, we have still not seen building neural networks in Kotlin and maybe cover that in some other post. But you just saw the power of Kotlin and how it extends to Data Science too, you also saw the vast array of tools and libraries to make your work a lot easier .
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so via the chat box on the website or by mailing me at
If you are a beginner in Machine Learning and want to get started with developing models in the browser this is for you, if you already know about the TensorFlow Python wrapper and want to explore TF.js you have come to the right place! Unlike of traditional blogs, I believe in handson learning and that is what we will do here. I would advise you to first read the prequel here.
All the code is available here Rishitdagli/GetstartedwithTF.js A simple repository to get started with TF.js and start making simple models  Rishitdagli/GetstartedwithTF.jsgithub.com
We will continue with what we did in the earlier blog. In the last blog we made our model and also preprocess the data, let us now fit the model and predict from it.
With our model instance created and our data represented as tensors we have everything in place to start the training process. Add this code to your script.js
async function trainModel(model, inputs, labels) {
// Prepare the model for training.
model.compile({
optimizer: tf.train.adam(),
loss: tf.losses.meanSquaredError,
metrics: ['mse'],
});
const batchSize = 32;
const epochs = 50;
return await model.fit(inputs, labels, {
batchSize,
epochs,
shuffle: true,
callbacks: tfvis.show.fitCallbacks(
{ name: 'Training Performance' },
['loss', 'mse'],
{ height: 200, callbacks: ['onEpochEnd'] }
)
});
}
Let us now break this down.
model.compile({
optimizer: tf.train.adam(),
loss: tf.losses.meanSquaredError,
metrics: ['mse'],
});
We have to compile the model before we train it. To do so, we have to specify a number of very important things:
optimizer
: This is the algorithm that is going to govern the updates to the model as it sees examples. There are many optimizers available in TensorFlow.js. Here we have picked the adam optimizer as it is quite effective in practice and requires no configuration.
loss
: this is a function that will tell the model how well it is doing on learning each of the batches (data subsets) that it is shown. Here we use meanSquaredError
to compare the predictions made by the model with the true values.
You can read more about them here.
const batchSize = 32;
const epochs = 50;
Now we will pick an optimal batch size and epochs
return await model.fit(inputs, labels, {
batchSize,
epochs,
callbacks: tfvis.show.fitCallbacks(
{ name: 'Training Performance' },
['loss', 'mse'],
{ height: 200, callbacks: ['onEpochEnd'] }
)
});
model.fit
is the function we call to start the training loop. It is an asynchronous function so we return the promise it gives us so that the caller can determine when training is complete. To monitor training progress we pass some callbacks to model.fit
. We use tfvis.show.fitCallbacks
to generate functions that plot charts for the loss' and mse' metric we specified earlier.
You just made a model and have all the required functions to make it traain and fit, what more you will now call these functions.
const tensorData = convertToTensor(data);
const {inputs, labels} = tensorData;
// Train the model
await trainModel(model, inputs, labels);
console.log('Done Training');
When you add this code to the run
function you are essentially calling the model fit function. When you refresh the page now you will be able to see a training graph.
These are created by the callbacks we created earlier. They display the loss and mse, averaged over the whole dataset, at the end of each epoch. When training a model we want to see the loss go down. In this case, because our metric is a measure of error, we would want to see it go down as well.
Now we need to make predictions through our model, let us do this. Add this to your script.js
to make predictions.
function testModel(model, inputData, normalizationData) {
const {inputMax, inputMin, labelMin, labelMax} = normalizationData;
// Generate predictions for a uniform range of numbers between 0 and 1;
// We unnormalize the data by doing the inverse of the minmax scaling
// that we did earlier.
const [xs, preds] = tf.tidy(() => {
const xs = tf.linspace(0, 1, 100);
const preds = model.predict(xs.reshape([100, 1]));
const unNormXs = xs
.mul(inputMax.sub(inputMin))
.add(inputMin);
const unNormPreds = preds
.mul(labelMax.sub(labelMin))
.add(labelMin);
// Unnormalize the data
return [unNormXs.dataSync(), unNormPreds.dataSync()];
});
const predictedPoints = Array.from(xs).map((val, i) => {
return {x: val, y: preds[i]}
});
const originalPoints = inputData.map(d => ({
x: d.horsepower, y: d.mpg,
}));
tfvis.render.scatterplot(
{name: 'Model Predictions vs Original Data'},
{values: [originalPoints, predictedPoints], series: ['original', 'predicted']},
{
xLabel: 'Horsepower',
yLabel: 'MPG',
height: 300
}
);
}
Lets break this down
const xs = tf.linspace(0, 1, 100);
const preds = model.predict(xs.reshape([100, 1]));
We generate 100 new examples to feed to the model. Model.predict is how we feed those examples into the model. Note that they need to be have a similar shape ([num_examples, num_features_per_example]
) as when we did training.
const unNormXs = xs
.mul(inputMax.sub(inputMin))
.add(inputMin);
const unNormPreds = preds
.mul(labelMax.sub(labelMin))
.add(labelMin);
This peice of code will unnormalize the data so I can get which is not 01
Lets now see this in action, add this to your run function
testModel(model, data, tensorData);
You will get similar results
We created an intermediate model with just 2 layers, tweak around the hyper parameters to see what more you could do. You just worked around the basics of creating a simple model using TF.js completely in the browser and created some wonderful visualization.
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so via the chat box on the website or by mailing me at
rishit.dagli@gmail.com
hello@rishit.tech
If you are a beginner in Machine Learning and want to get started with developing models in the browser this is for you, if you already know about the TensorFlow Python wrapper and want to explore TF.js you have come to the right place! Unlike of traditional blogs, I believe in handson learning and that is what we will do here.
All the code is available here Rishitdagli/GetstartedwithTF.js A simple repository to get started with TF.js and start making simple models  Rishitdagli/GetstartedwithTF.jsgithub.com
With TF.js you can use offtheshelf JavaScript models or convert Python TensorFlow models to run in the browser or under Node.js very easily and still have all the capability.
Retrain preexisting ML models using your own data. If you think this is something like Transfer Learning, yes it is but in the browser itself.
And finally here is the best part you can build and train models directly in JavaScript using flexible and intuitive APIs just like you would do in Python with some different keywords of course.
TF.js supports development with almost any library, Before you ask me in February, 2020; the TensorFlow team announced support for React Native too and this can get you anywhere.
This features are just a summary of TF.js and make it indispensable.
All you would need to follow me is latest version of chrome and a text editor of your choice. That is it.
We will start by creating a simple HTML page and adding JavaScript to it.
index.html
<!DOCTYPE html>
<html>
<head>
<title>TensorFlow.js Tutorial</title>
<! Import TensorFlow.js >
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjs@1.0.0/dist/tf.min.js"></script>
<! Import tfjsvis >
<script src="https://cdn.jsdelivr.net/npm/@tensorflow/tfjsvis@1.0.2/dist/tfjsvis.umd.min.js"></script>
<! Import the main script file >
<script src="script.js"></script>
</head>
<body>
</body>
</html>
script.js
console.log("Hello Home");
Run the index.html
file in the web browser. If you use Chrome Ctrl + Shift + I
For other browsers find the developer tools option. In the console you can now see, Hello Home and two runtime variables tf
and tfvis
.
We will now create a model and work with it. We will be using cars data set for this tutorial. It contains many different features about each given car. We want to only extract data about Horsepower and Miles Per Gallon. So, we will add this code to script.js
/**
* Get the car data reduced to just the variables we are interested
* and cleaned of missing data.
*/
async function getData() {
const carsDataReq = await fetch('https://storage.googleapis.com/tfjstutorials/carsData.json');
const carsData = await carsDataReq.json();
const cleaned = carsData.map(car => ({
mpg: car.Miles_per_Gallon,
horsepower: car.Horsepower,
}))
.filter(car => (car.mpg != null && car.horsepower != null));
return cleaned;
}
Lets now create a visualization, a scatter plot for the data with this code added to script.js
async function run() {
// Load and plot the original input data that we are going to train on.
const data = await getData();
const values = data.map(d => ({
x: d.horsepower,
y: d.mpg,
}));
tfvis.render.scatterplot(
{name: 'Horsepower v MPG'},
{values},
{
xLabel: 'Horsepower',
yLabel: 'MPG',
height: 300
}
);
// More code will be added below
}
document.addEventListener('DOMContentLoaded', run);
When you refresh the page now, you would be able to see a cool scatter plot of the data
Scatter plot of the data
This panel is known as the visor and is provided by tfjsvis. It provides a easy ans wonderful place to display visualizations. Our goal is to train a model that will take one number, Horsepower and learn to predict Miles per Gallon. Let;s now see how we will do this, I expect you to know the basics about supervised learning and neural networks.
This is the most interesting section where you will be trying and experimenting for yourself. Lets now go about this. If you do not know about neural nets, it is an algorithm with a set of layers of neurons with weights governing their output. The training process of this neural net learns the ideal values for those weights to get the most accurate output. Lets see how we can make a very simple model
function createModel() {
// Create a sequential model
const model = tf.sequential();
// Add a single input layer
model.add(tf.layers.dense({inputShape: [1], units: 1, useBias: true}));
// Add an output layer
model.add(tf.layers.dense({units: 1, useBias: true}));
return model;
}
We will now break this down and understand each bit of it.
const model = tf.sequential();
This instantiates a [tf.Model
](https://js.tensorflow.org/api/latest/#class:Model) object. This model is [sequential
](https://js.tensorflow.org/api/latest/#sequential) because its inputs flow straight down to its output. This might seem similar to you if you have worked with the Python wrapper.
model.add(tf.layers.dense({inputShape: [1], units: 1, useBias: true}));
This adds an input layer to our network, which is automatically connected to a dense
layer with one hidden unit. A dense
layer is a type of layer that multiplies its inputs by a matrix (called weights) and then adds a number (called the bias) to the result. As this is the first layer of the network, we need to define our inputShape
. The inputShape
is [1]
because we have 1
number as our input (the horsepower of a given car).units
sets how big the weight matrix will be in the layer. By setting it to 1 here we are saying there will be 1 weight for each of the input features of the data.
Now we need to create an instance of the model, so we will add this line of code to the run function we made earlier
const model = createModel(); tfvis.show.modelSummary({name: 'Model Summary'}, model);
Now that we have the model we need to prepare our data to feed it in neural network.
To get the performance benefits of TensorFlow.js that make training machine learning models practical, we need to convert our data to tensors. We will also perform a number of transformations on our data that are best practices, namely shuffling and **normalization. **Add this to the code
/**
* Convert the input data to tensors that we can use for machine
* learning. We will also do the important best practices of _shuffling_
* the data and _normalizing_ the data
* MPG on the yaxis.
*/
function convertToTensor(data) {
// Wrapping these calculations in a tidy will dispose any
// intermediate tensors.
return tf.tidy(() => {
// Step 1. Shuffle the data
tf.util.shuffle(data);
// Step 2. Convert data to Tensor
const inputs = data.map(d => d.horsepower)
const labels = data.map(d => d.mpg);
const inputTensor = tf.tensor2d(inputs, [inputs.length, 1]);
const labelTensor = tf.tensor2d(labels, [labels.length, 1]);
//Step 3. Normalize the data to the range 0  1 using minmax scaling
const inputMax = inputTensor.max();
const inputMin = inputTensor.min();
const labelMax = labelTensor.max();
const labelMin = labelTensor.min();
const normalizedInputs = inputTensor.sub(inputMin).div(inputMax.sub(inputMin));
const normalizedLabels = labelTensor.sub(labelMin).div(labelMax.sub(labelMin));
return {
inputs: normalizedInputs,
labels: normalizedLabels,
// Return the min/max bounds so we can use them later.
inputMax,
inputMin,
labelMax,
labelMin,
}
});
}
Lets break this down
tf.util.shuffle(data);
Here we randomize the order of the examples we will feed to the training algorithm. Shuffling is important because typically during training the dataset is broken up into smaller subsets, called batches, that the model is trained on. This further helps us remove sequence based biasses.
const inputs = data.map(d => d.horsepower)
const labels = data.map(d => d.mpg);
const inputTensor = tf.tensor2d(inputs, [inputs.length, 1]);
const labelTensor = tf.tensor2d(labels, [labels.length, 1]);
Here we first make 2 arrays an input and labels array and then convert it into tensors.
const inputMax = inputTensor.max();
const inputMin = inputTensor.min();
const labelMax = labelTensor.max();
const labelMin = labelTensor.min();
const normalizedInputs = inputTensor.sub(inputMin).div(inputMax.sub(inputMin));
const normalizedLabels = labelTensor.sub(labelMin).div(labelMax.sub(labelMin));
Here we normalize the data into the numerical range 01
using *minmax scaling). *Normalization is important because the internals of many machine learning models you will build with tensorflow.js are designed to work with numbers that are not too big. Common ranges to normalize data to include 0 to 1
or 1 to 1
.
return {
inputs: normalizedInputs,
labels: normalizedLabels,
// Return the min/max bounds so we can use them later.
inputMax,
inputMin,
labelMax,
labelMin,
}
Now we will finally return the data and the normalization bounds.
This is not the end, we will explore the next process in a continuation blog. For now we first saw the capabilities of TF.js, we then made some cool visualizations. After which we made a super simple neural network. Then, we saw how we can preprocess the data. Next up we will see how we can train our net and make predictions.
Read the next blog here.
Hi everyone I am Rishit Dagli
LinkedIn linkedin.com/in/rishitdagli440113165/
Website rishit.tech
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so via the chat box on the website or by mailing me at
rishit.dagli@gmail.com
hello@rishit.tech
All the code used here is available in my GitHub repository here.
This is the fourth part of the series where I post about TensorFlow for Deep Learning and Machine Learning. In the earlier blog post, you saw a Convolutional Neural Network for Computer Vision. It did the job pretty nicely. This time youre going to improve your skills with some reallife data sets and apply the discussed algorithms on them too. I believe in handson coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyperparameters and experiment with the code. If you have not read the previous article consider reading it once before you read this one here. This one is more like a continuation of that.
You can find the notebook used by me here. Again, you can download the notebook if you are using a local environment and if you are using Colab, you can click on opeen in colab
button.
This is a really nice way to improve our image recognition performance. Lets now look at it in action using a notebook. Heres the same neural network that you used before for loading the set of images of clothing and then classifying them. By the end of epoch five, you can see the loss is around 0.34, meaning, your accuracy is pretty good on the training data.
Output with DNN
It took just a few seconds to train, so thats not bad. With the test data as before and as expected, the losses a little higher and thus, the accuracy is a little lower.
So now, you can see the code that adds convolutions and pooling. Were going to do 2 convolutional layers each with 64 convolutions, and each followed by a max pooling layer.
You can see that we defined our convolutions to be threebythree and our pools to be twobytwo. Lets train. The first thing youll notice is that the training is much slower. For every image, 64 convolutions are being tried, and then the image is compressed and then another 64 convolutions, and then its compressed again, and then its passed through the DNN, and thats for 60,000 images that this is happening on each epoch. So it might take a few minutes instead of a few seconds. To remedy this what you can do is use a GPU. How to do that in Colab?
All you need to do is Runtime > Change Runtime Type > GPU. A single layer would now take approximately 56 seconds.
Output with the Convolutions and max poolings
Now that its done, you can see that the loss has improved a little its 0.25 now. In this case, its brought our accuracy up a bit for both our test data and with our training data. Thats pretty cool, right?
Now, this is a really fun visualization of the journey of an image through the convolutions. First, Ill print out the first 100 test labels. The number 9 as we saw earlier is a shoe or boots. I picked out a few instances of this whether the zero, the 23rd and the 28th labels are all nine. So lets take a look at their journey.
The visualization
The Keras API gives us each convolution and each pooling and each dense, etc. as a layer. So with the layers API, I can take a look at each layers outputs, so Ill create a list of each layers output. I can then treat each item in the layer as an individual activation model if I want to see the results of just that layer. Now, by looping through the layers, I can display the journey of the image through the first convolution and then the first pooling and then the second convolution and then the second pooling. Note how the size of the image is changing by looking at the axes. If I set the convolution number to one, we can see that it almost immediately detects the laces area as a common feature between the shoes.
So, for example, if I change the third image to be one, which looks like a handbag, youll see that it also has a bright line near the bottom that could look like the soul of the shoes, but by the time it gets through the convolutions, thats lost, and that area for the laces doesnt even show up at all. So this convolution definitely helps me separate issue from a handbag. Again, if I said its a two, it appears to be trousers, but the feature that detected something that the shoes had in common fails again. Also, if I changed my third image back to that for shoe, but I tried a different convolution number, youll see that for convolution two, it didnt really find any common features. To see commonality in a different image, try images two, three, and five. These all appear to be trousers. Convolutions two and four seem to detect this vertical feature as something they all have in common. If I again go to the list and find three labels that are the same, in this case six, I can see what they signify. When I run it, I can see that they appear to be shirts. Convolution four doesnt do a whole lot, so lets try five. We can kind of see that the color appears to light up in this case.
There are some exercises at the bottom of the notebook check them out.
We willcreate a little pooling algorithm, so you can visualize its impact. Theres a notebook that you can play with too, and Ill step through that here. Heres the notebook for playing with convolutions here. It does use a few Python libraries that you may not be familiar with such as cv2
. It also has Matplotlib
that we used before. If you havent used them, theyre really quite intuitive for this task and theyre very very easy to learn. So first, well set up our inputs and in particular, import the misc library from SciPy
. Now, this is a nice shortcut for us because misc.ascent
returns a nice image that we can play with, and we dont have to worry about managing our own.
Matplotlib
contains the code for drawing an image and it will render it right in the browser with Colab. Here, we can see the ascent image from SciPy
. Next up, well take a copy of the image, and well add it with our homemade convolutions, and well create variables to keep track of the x
and y
dimensions of the image. So we can see here that its a 512 by 512 image. So now, lets create a convolution as a three by three array. Well load it with values that are pretty good for detecting sharp edges first. Heres where well create the convolution.
We then iterate over the image, leaving a one pixel margin. Youll see that the loop starts at one and not zero, and it ends at size x minus one and size y minus one. In the loop, it will then calculate the convolution value by looking at the pixel and its neighbors, and then by multiplying them out by the values determined by the filter, before finally summing it all up.
Vertical line filter
Lets run it. It takes just a few seconds, so when its done, lets draw the results. We can see that only certain features made it through the filter. Ive provided a couple more filters, so lets try them. This first one is really great at spotting vertical lines. So when I run it, and plot the results, we can see that the vertical lines in the image made it through. Its really cool because theyre not just straight up and down, they are vertical in perspective within the perspective of the image itself. Similarly, this filter works well for horizontal lines. So when I run it, and then plot the results, we can see that a lot of the horizontal lines made it through. Now, lets take a look at pooling, and in this case, Max pooling, which takes pixels in chunks of four and only passes through the biggest value. I run the code and then render the output. We can see that the features of the image are maintained, but look closely at the axes, and we can see that the size has been halved from the 500s to the 250's.
With pooling
Now you need to apply this to MNIST Handwrting recognition we will revisit that from last blog post. You need to improve MNIST to 99.8% accuracy or more using only a single convolutional layer and a single MaxPooling 2D. You should stop training once the accuracy goes above this amount. It should happen in less than 20 epochs, so its ok to hard code the number of epochs for training, but your training must end once it hits the above metric. If it doesnt, then youll need to redesign your layers.
When 99.8% accuracy has been hit, you should print out the string Reached 99.8% accuracy so cancelling training!. Yes this is just optional (You can also print out something like Im getting bored and wont train any more 🤣)
The question notebook is available here
Wonderful! 😃 , you just coded for a handwriting recognizer with a 99.8% accuracy (thats good) in less than 20 epochs. Let explore my solution for this.
def train_mnist_conv():
# Please write your code only where you are indicated.
# please do not remove model fitting inline comments.
# YOUR CODE STARTS HERE
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('acc')>0.998):
print("/n Reached 99.8% accuracy so cancelling training!")
self.model.stop_training = True
# YOUR CODE ENDS HERE
mnist = tf.keras.datasets.mnist
(training_images, training_labels), (test_images, test_labels) = mnist.load_data()
# YOUR CODE STARTS HERE
callbacks = myCallback()
training_images=training_images.reshape(60000, 28, 28, 1)
test_images=test_images.reshape(10000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images / 255.0
# YOUR CODE ENDS HERE
model = tf.keras.models.Sequential([
# YOUR CODE STARTS HERE
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
# YOUR CODE ENDS HERE
])
model.compile(optimizer='adam', loss='sparse_categorical_crossentropy', metrics=['accuracy'])
# model fitting
history = model.fit(
# YOUR CODE STARTS HERE
training_images,
training_labels,
epochs = 20,
callbacks=[callbacks]
# YOUR CODE ENDS HERE
)
# model fitting
return history.epoch, history.history['acc'][1]
The callback class (This is the simplest)
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('acc')>0.998):
print("/n Reached 99.8% accuracy so cancelling training!")
self.model.stop_training = True
The main CNN code
training_images=training_images.reshape(60000, 28, 28, 1)
test_images=test_images.reshape(10000, 28, 28, 1)
training_images = training_images / 255.0
test_images = test_images / 255.0
# YOUR CODE ENDS HERE
model = tf.keras.models.Sequential([
# YOUR CODE STARTS HERE
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(256, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
# YOUR CODE ENDS HERE
])
So, all you had to do was play around the code and get this done in just 7 epochs.
My output
The solution notebook is available here
I hope this was helpful for you and you learned about visualizing CNNs and applying them on a real life data set, you also created a handwritten number recognizer all by yourself with a wonderful accuracy. Thats pretty good
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>Ignited Minds Expo is an annual Robotics and technology event featuring many ideas, demonstrations, coding sessions, lightning talks, exhibits, competitions and anything more you could think of by Science Kidz and Young Engineers Club in collaboration. I myself have a keynote talk about Artificial Intelligence, I will share with you its experience too.
This year Ignited Minds Expo witnessed 5000+ developers, students and tech enthusiasts!
There was a kind of enthusiasm I could feel in the air as the people were getting ready to showcase their prototypes or present their ideas. For many students this became a starting platform to learn about demonstrating your idea or conveying your idea to people at a large scale.
The event also featured many professionals all across India. The event featured almost all of your wild imaginations; robots playing soccer, golf, bowling; AI models greeting you according to your emotions; drones racing and welcoming you in the hall; a virtual paintball game; AR and VR games and lot more.
The best part according to me was the fact that you dont just get to see these kind of tech applications but the exhibitors are more than happy to share with you how they made it, what was their motivation behind it and which technology they used, even the source code of the applications. So, you can actually get to know how it was built and explore more for yourself. Moreover there were some wonderful games too. Most of the models (90% + ) were working and realistic models which meant you could see how they could work, a demonstration of it and visualize the condition under which they might be useful.
The event also had some cool science experiments alongside technology related exhibits and you could explore some science too! In the event you could see 10 year olds making a simple but novel robotic model maybe on some easy to use interface like WeDo, but it is quite stunning to see that 10 year olds can think about the problems around them and solve them using technology.
I felt that Science Kidz and Young Engineers Club (YEC) are open to latest and quite new technologies like Artificial Intelligence, Virtual Reality, Augmented Reality and IoT. The best part about projects made using these technologies is that you can understand these projects even if you are a layman or have never worked with these projects. To get more information all you need to do was ask the exhibitors and they were ready to furnish you with the necessary information. I also observed that most of the exhibitors were ready to make an update, they are open to recommendations too. You can discuss with them the pros and cons of a solution. Most of the exhibitors were students who were well versed in their respective fields. I, myself had an interesting (and quite long 😂) conversation with one of the exhibitors about the alternative solutions we could use.
The team
It was a wonderful experience and a joyous one of giving a keynote talk about AI, what I am good at. Addressing so many people at a time, when everyone is focused on you and what you are speaking. I felt quote nervous while delivering the talk. I was constantly scanning faces of people and trying to understand their expressions. After like 20 seconds I stopped doing so as it had made me more nervous and I planned, not to do so. And there I was after some time, I had delivered my talk well and the audience seemed to be receptive to the talk too.
My talk was a keynote talk so I had to make sure that people understood well what I was trying to convey to them. Let me share with you few lines from my first draft of the talk which were then not shortlisted but were very cool according to me
I have worked for quite a time with AI and feel that there are two kinds of people, one who dont know AI and say it will destroy humanity and take away jobs and other who know AI and are wondering why their model is classifying a cat as a dog
By Shubham Sah.
My talk
So, I first started with where AI could be used and how it could be used. Then explained the difference between Artificial Intelligence, Machine Learning and Deep Learning. I then showed some algorithms which could be used to solve some pretty elementary problems to hardcore industry problems. I then explained and showed a live demo about an AI algorithm which could do emotion and face recognization on edge. I had made an Android app to do so and ran the app in front of the audience and showed them how accurate the model was.
When, I was showing this model I had run into a small issue. I was running the app on my mobile device and live streaming via WiFi, its screen to the laptop connected to the screen behind me. It just happened that the computer lost the connection to my mobile screen. I was shocked and scared for a second, there was a retry button on the screen. That was my last hope. And guess what, the computer controller, pressed the retry button and it worked! And I continued merrily with my demo. That was a scary event for me! Fortunately enough things worked out.
Another line from my talk which I personally love
A good reason to learn AI is to chat in AI language. Chat in AI language? Ok what is OMG, not oh my God but oh my gradients. What is BRB? Not be right back but back propagation right back😂
Wonderful feedback!
I learnt about presenting projects
Show your projects in a favorable manner
Got some new ideas too
Saw some working models of wonderful ideas
So, I would say overall it was a huge success and I loved my time at the event.
]]>You have probably seen this image below, it's quite popular by the way. And you too might have faced issues like Android Studio is very slow, build takes a lot of time, AVD starts after a long time if you are on a slow machine like mine. So, what we will see now is how we set up a fully functional Android development environment on the cloud and you can too! I will be showing you this with Google Cloud Platform(GCP). You can use the free credits provided by GCP to follow along. When we replicated this system we observed a mindblowing performance 🤯, the first Gradle Build happening in just over a second and Flutter Builds in 3 seconds or so! Now that is a great speed. You can imagine this as a 15 GB machine that runs only Android studio.
Source: devrant.com
What you simply might want to do when someone tells you to make an Android development environment on the cloud is
Spin up a VM
Install some desktop environment in it and connect to your VM
Install Android studio
And get running. But alas this does not work 😞, you will get a recommendation like this while creating the AVD and the AVD wont start.
Error with the simple approach
HAXM does not support nested virtual machines
So, GCP or for that matter any other cloud supporter will not provide you the ability to create nested virtual machines by default. It is blocked by default so Android studio would work, you would be able to write code and perform builds but you would not be able to run an AVD. That is not much useful so lets see how we can solve this problem.
Compute Engine VMs run on top of physical hardware (the host server) which is referred to as the L0
environment. Within the host server, a preinstalled hypervisor allows a single server to host multiple Compute Engine VMs, which are referenced to as L1
or native VMs on Compute Engine. When you use nested virtualization, you install another hypervisor on top of the L1
guest OS and create nested VMs, referred to as L2
VMs, using the L1
hypervisor. L1
or native Compute Engine VMs that are running a guest hypervisor and nested VMs can also be referred to as host VMs.
Nested virtualization
In GCP
Nested virtualization can only be enabled for L1
VMs running on Haswell processors or later. If the default processor for a zone is Sandy Bridge or Ivy Bridge, you can use minimum CPU selection to choose Haswell or later for a particular instance.
Nested virtualization is supported only for KVMbased hypervisors running on Linux instances
Windows VMs do not support nested virtualization, only some Windows OSes support it.
Apart from setting this up on a Linux VM and getting the Linux feel, we will also see how to set up a Windows VM as Windows VM require almost no additional setup to get started whereas in a Linux instance it might take you some time to get started. Also, Linux usually uses VNC whereas Windows has developed RDP for remote connection. And it is complicated to have VNC and RDP running on Linux at the same time.
And RDP is undoubtedly better than VNC in performance. But again RDP usually refers to Windows while VNC is universal. You have RDP clients for Linux and MAC so theres no need to worry. We will see both of these approaches.
For more details of how nested virtualization works and what restrictions exist for nested virtualization, see Enabling nested virtualization for VM instances.
These are the OSes that support nested VM in GCP but still, some setup would be required.
OSes which allow nested VM
We expect you to install gcloud
on your system which is the primary CLI tool to create and manage Google Cloud resources. Follow this link to install gcloud
.
To make a VM you first of course need to create a boot disk. So lets do this in gcloud
.
gcloud compute disks create disk1
imageproject debiancloud
imagefamily debian9
size 100
zone uscentral1b
You are here making a boot disk called disk1
with a Debian image, change it to Windows Server 2016 if you want to make a Windows Server instance. The size
argument here expects the size of the disk in GBs. 100 GB is good enough for me, feel free to change the value according to your requirement. So for a Windows VM use
gcloud compute disks create disk1
imagefamily windows2016
imageproject gceuefiimages
size 100
zone uscentral1b
gcloud compute images create [name_of_image]
sourcedisk disk1
sourcediskzone uscentral1b
licenses "https://compute.googleapis.com/compute/v1/projects/vmoptions/global/licenses/enablevmx"
replace [name_of_image]
with the name you like. What you are doing here is essentially making an image from the boot disk we created and applying the license necessary to allow it to perform nested virtualization.
After you create the image with the necessary license, you can delete the source disk if you no longer need it.
Create a VM instance using the new custom image with the license created in Step 2. You must create the instance in a zone that supports the Haswell CPU Platform or newer.
gcloud compute instances create [name_of_vm]
zone uscentral1b
mincpuplatform "Intel Haswell"
image [name_of_image]
customcpu = 4
custommemory = 15
Here you simply are creating a VM instance with the image you just created in Step 2.
You can edit custom CPU and memory this is the number of virtual CPUs and the RAM you need. This configuration is good enough to run Android Studio at the speed mentioned at the start of the blog, but feel free to change it.
If you made a Windows instance you are done 😄, you will now just have to test it out, so head on to the Connecting to VM section. This is because GCP sets up a connection stream and a good desktop GUI needed for Android Studio only for Windows instances.
What you now need to do is create a desktop environment and make a connection channel so you can connect to your VM with that desktop environment.
You can use gnome
, xfce
etc. There are so many options but for this post, I will be using xfce
which takes minimal time to download. SSH in your VM instance and enter
sudo aptget install lxde
You will be asked with some options, feel free to customize them bur if you simply want to follow along just select Yes.
sudo aptget install tightvncserver
This command installs tightvncserver
which will help us establish a VNC connection.
Go to VNC network > Firewall rules or alternatively use this link and log in with the same GCP account.
We need to do this as we want to allow vncserver
to access port 5901
. Choose a name and target tag for it, set allowed protocols to tcp:5901
. As an example here is the firewall rule I made. It was named vncserver
and had the target tag as vncserver
.
A sample firewall rule
Now go to VM instances page > Select the VM you are using > edit VM > Networking. Add the target tag in the network tag textbox. You might need to stop VM to do this.
Navigate to .vnc/xtsrtup
and open this in your favorite text editor like vim
, nano
etc. Add /usr/bin/startlxde
at its end. This will tell VNC to render lxde
desktop at startup.
You just made a desktop environment on your Linux VM 😃
SSH in your VM instance and type
vncserver
This starts a vncserver
session. Use a VNC viewer like RealVNC or Tight VNC viewer. Now you will connect to your VM
A VNC client
Enter the external IP of your VM followed by :5901
to tell it to view port 5901
. And you are done with connecting to your VM instance.
You first need to make a firewall rule to allow RDP in your instance. Go to VNC network > Firewall. Click create a firewall rule and add tcp:3389
in protocols and ports section, give a name and target tag to the firewall rule. Save it and navigate back to the VM instances page. In the VM instances page > Select the VM you are using > edit VM > Networking add the target tag you just put in while creating the firewall rule.
Your local OS
Windows you are ready to connect without any more setup. Open the remote desktop app preinstalled.
Any other OS install an RDP client like
For MAC Download the Microsoft Remote Desktop client from the Mac App Store or Cord.
You are now ready to connect to your VM. In the GCP console click set windows password
, make a new account, and save the password somewhere you remember.
Now enter your external IP of VM in your RDP client, it will ask you for username and password enter the credentials you just created and you just connected to your instance!
Download Android Studio from the official website from the VM. If youre on a Windows VM consider installing Chrome first (No internet explorer 😆).
Windows: you are ready to run the AVD and start development 😃
Linux:
You might get a dev\kvm permission denied
error.
qemukvm
.sudo apt install qemukvm
adduser
command to add your user to the kvm
group.sudo adduser <username> kvm
And now you are done you can run an AVD here and do your development with a mindblowing performance. I would recommend you to change the heap size of your AVD to at least 512 MB for good performance.
The AVD runs!
Youre ready to start building with Android Studio. Heres an indication of the first Gradle build happening in just over a second when the author used this setup:
The first Gradle build in a little over a second!
If graphics arent rendered on the AVD when you are using the GPU, edit your AVD settings to switch the Emulated Performance setting to Software. This allows you to select how graphics are rendered in the emulator:
Hardware Use your computer graphics card for faster rendering.
Software Emulate the graphics in software, which is useful if youre having a problem with rendering in your graphics card.
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>All the code used here is available in my GitHub repository here.
This is the third part of the series where I post about TensorFlow for Deep Learning and Machine Learning. In the earlier blog post you saw a basic Neural Network for Computer Vision. It did the job nicely, but it was a little naive in its approach. This time youre going to improve on that using Convolutional Neural Networks(CNN). I believe in handson coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyperparameters and experiment with the code. We will also be working with some real life data sets and apply the discussed algorithms on them too. If you have not read the previous article consider reading it once before you read this one here.
I think its really cool that youre already able to implement a neural network to do this fashion classification task. Its just amazing that large data sets like this are readily available to you which makes it really easy to learn. And in this case we saw with just a few lines of code, we were able to build a DNN, deep neural net that allowed you to do this classification of clothing and we got reasonable accuracy with it but it was a little bit of a naive algorithm that we used, right? Were looking at each and every pixel in every image, but maybe there are ways that we can make it better but maybe looking at features of what makes a shoe a shoe and what makes a handbag a handbag. What do you think? You might think something like if I have a shoelace in the picture it could be a shoe and if there is a handle it may be a handbag.
So, one of the ideas that make these neural networks work much better is to use convolutional neural networks, where instead of looking at every single pixel and say, getting the pixel values and then figuring out, is this a shoe or is this a hand bag? I dont know. But instead you can look at a picture and say, Ok, I see shoelaces and a sole. Then, its probably shoe or say, I see a handle and rectangular bag beneath that. Probably a handbag. Whats really interesting about convolutions is that they sound very complicated but theyre actually quite straightforward. It is essentially just a filter that you pass over an image in the same way as if youre doing some sharpening. If you have ever done image processing, it can spot features within the image just like we talked about. With the same paradigm of just data labels, we can let a neural network figure out for itself that it should look for shoe laces and soles or handles in bags and then just learn how to detect these things by itself. So we will also see how good or bad it works in comparison to our earlier approach for Fashion MNIST?
So, now we will know about convolutional neural networks and get to use it to build a much better fashion classifier.
In the DNN approach, in just a couple of minutes, youre able to train it to classify with pretty high accuracy on the training set, but a little less on the test set. Now, one of the things that you would have seen when you looked at the images is that theres a lot of wasted space in each image. While there are only 784 pixels, it will be interesting to see if there was a way that we could condense the image down to the important features that distinguish what makes it a shoe, or a handbag, or a shirt. Thats where convolutions come in. So, whats convolution at all?
How filters work?
If you have ever done any kind of image processing, it usually involves having a filter and passing that filter over the image in order to change the underlying image. The process works a little bit like this. For every pixel, take its value, and take a look at the value of its neighbors. If our filter is three by three, then we can take a look at the immediate neighbor, so that you have a corresponding 3 by 3 grid. Then to get the new value for the pixel, we simply multiply each neighbor by the corresponding value in the filter. So, for example, in this case, our pixel has the value 192, and its upper left neighbor has the value 0. The upper left value and the filter is 1, so we multiply 0 by 1. Then we would do the same for the upper neighbor. Its value is 64 and the corresponding filter value was 0, so wed multiply those out.
Repeat this for each neighbor and each corresponding filter value, and would then have the new pixel with the sum of each of the neighbor values multiplied by the corresponding filter value, and thats a convolution. Its really as simple as that. The idea here is that some convolutions will change the image in such a way that certain features in the image get emphasized.
So, for example, if you look at this filter, then the vertical lines in the image really pop out. Dont worry we will do a handson for this later.
A filter which pops out vertical lines
With this filter, the horizontal lines pop out.
A filter which pops out horizontal lines
Now, thats a very basic introduction to what convolutions do, and when combined with something called pooling, they can become really powerful.
Now whats pooling then? pooling is a way of compressing an image. A quick and easy way to do this, is to go over the image of 4 pixels at a time, that is, the current pixel and its neighbors underneath and to the right of it.
a 4 x 4 pooling
Of these 4, pick the biggest value and keep just that. So, for example, you can see it here. My 16 pixels on the left are turned into the four pixels on the right, by looking at them in 2 by 2 grids and picking the biggest value. This will preserve the features that were highlighted by the convolution, while simultaneously quartering the size of the image. We have the horizontal and vertical axes too.
These layers are available as
Conv2D and
in TensorFlow.
We dont have to do all the math for filtering and compressing, we simply define convolutional and pooling layers to do the job for us.
So heres our code from the earlier example, where we defined out a neural network to have an input layer in the shape of our data, and output layer in the shape of the number of categories were trying to define, and a hidden layer in the middle. The Flatten takes our square 28 by 28 images and turns them into a one dimensional array.
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(128, activation = tf.nn.relu,
tf.keras.layers.Dense(10, activation = tf.nn.softmax
])
To add convolutions to this, you use code like this.
model = tf.keras.models.Sequential([
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1)),
tf.keras.layers.MaxPooling2D(2, 2),
tf.keras.layers.Conv2D(64, (3,3), activation='relu'),
tf.keras.layers.MaxPooling2D(2,2),
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(128, activation='relu'),
tf.keras.layers.Dense(10, activation='softmax')
])
Youll see that the last three lines are the same, the Flatten
, the Dense
hidden layer with 128 neurons, and the Dense
output layer with 10 neurons. Whats different is what has been added on top of this. Lets take a look at this, line by line.
tf.keras.layers.Conv2D(64, (3,3), activation='relu', input_shape=(28, 28, 1))
Here were specifying the first convolution. Were asking keras
to generate 64 filters for us. These filters are 3 by 3, their activation is relu
, which means the negative values will be thrown way, and finally the input shape is as before, the 28 by 28. That extra 1 just means that we are tallying using a single byte for color depth. As we saw before our image is our gray scale, so we just use one byte. Now, of course, you might wonder what the 64 filters are. Its a little beyond the scope of this blog to define them, but they for now you can understand that they are not random. They start with a set of known good filters in a similar way to the pattern fitting that you saw earlier, and the ones that work from that set are learned over time.
tf.keras.layers.MaxPooling2D(2, 2)
This next line of code will then create a pooling layer. Its maxpooling because were going to take the maximum value. Were saying its a twobytwo pool, so for every four pixels, the biggest one will survive as shown earlier. We then add another convolutional layer, and another maxpooling layer so that the network can learn another set of convolutions on top of the existing one, and then again, pool to reduce the size. So, by the time the image gets to the flatten to go into the dense layers, its already much smaller. Its being quartered, and then quartered again. So, its content has been greatly simplified, the goal being that the convolutions will filter it to the features that determine the output.
A really useful method on the model is the model.summary
method. This allows you to inspect the layers of the model, and see the journey of the image through the convolutions, and here is the output.
Output of model.summary
Its a nice table showing us the layers, and some details about them including the output shape. Its important to keep an eye on the output shape column. When you first look at this, it can be a little bit confusing and feel like a bug. After all, isnt the data 28 by 28, so why is the output, 26 by 26? The key to this is remembering that the filter is a 3 by 3 filter. Consider what happens when you start scanning through an image starting on the top left. So, you cant calculate the filter for the pixel in the top left, because it doesnt have any neighbors above it or to its left. In a similar fashion, the next pixel to the right wont work either because it doesnt have any neighbors above it. So, logically, the first pixel that you can do calculations on is this one, because this one of course has all 8 neighbors that a three by 3 filter needs. This when you think about it, means that you cant use a 1 pixel margin all around the image, so the output of the convolution will be 2 pixels smaller on x
, and 2 pixels smaller on y
. If your filter is fivebyfive for similar reasons, your output will be four smaller on x
, and four smaller on y
. So, thats y
with a three by three filter, our output from the 28 by 28 image, is now 26 by 26, weve removed that one pixel on x
and y
, and each of the borders.
So, now our output gets reduced from 26 by 26, to 13 by 13. The convolutions will then operate on that, and of course, we lose the 1 pixel margin as before, so were down to 11 by 11, add another 2 by 2 maxpooling to have this rounding down, and went down, down to 5 by 5 images. So, now our dense neural network is the same as before, but its being fed with fivebyfive images instead of 28 by 28 ones.
But remember, its not just one compressed 5 by 5 image instead of the original 28 by 28, there are a number of convolutions per image that we specified, in this case 64. So, there are 64 new images of 5 by 5 that had been fed in. Flatten that out and you have 25 pixels times 64, which is 1600. So, you can see that the new flattened layer has 1,600 elements in it, as opposed to the 784 that you had previously. This number is impacted by the parameters that you set when defining the convolutional 2D layers. Later when you experiment, youll see what the impact of setting what other values for the number of convolutions will be, and in particular, you can see what happens when youre feeding less than 784 over all pixels in. Training should be faster, but is there a sweet spot where its more accurate?
To know more about this, implement this on a real life data set and even visualize the journey of an image through a CNN please head on to the next blog.
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>All the code used here is available at the GitHub repository here.
This is the second part of the series where I post about TensorFlow for Deep Learning and Machine Learning. In the earlier blog post, you learned all about how Machine Learning and Deep Learning is a new programming paradigm. This time youre going to take that to the next level by beginning to solve problems of computer vision with just a few lines of code! I believe in handson coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyperparameters and experiment with the code. We will also be working with some reallife data sets and apply the discussed algorithms to them too. If you have not read the previous article consider reading it once before you read this one here.
So fitting straight lines seems like the Hello, world most basic implementation learning algorithm. But one of the most amazing things about machine learning is that, that core of the idea of fitting the x
and y
relationship is what lets us do amazing things like, have computers look at the picture and do activity recognition, or look at the picture and tell us, is this a dress, or a pair of pants, or a pair of shoes; really hard for humans, and amazing that computers can now use this to do these things as well. Right, like computer vision is a really hard problem to solve, right? Because youre saying like dress or shoes. Its like how would I write rules for that? How would I say, if this pixel then its a shoe, if that pixel then its a dress? Its really hard to do, so the labeled samples are the right way to go. One of the nonintuitive things about vision is that its so easy for a person to look at you and say, youre wearing a shirt, its so hard for a computer to figure it out.
Because its so easy for humans to recognize objects, its almost difficult to understand why this is a complicated thing for a computer to do. What the computer has to do is look at all numbers, all the pixel brightness value, saying look at all of these numbers saying, these numbers correspond to a black shirt, and its amazing that with machine and deep learning computers are getting really good at this.
Computer vision is the field of having a computer understand and label what is present in an image. When you look at this image below, you can interpret what a shirt is or what a shoe is, but how would you program for that? If an extraterrestrial who had never seen clothing walked into the room with you, how would you explain the shoes to him? Its really difficult, if not impossible to do right? And its the same problem with computer vision. So one way to solve that is to use lots of pictures of clothing and tell the computer what thats a picture of and then have the computer figure out the patterns that give you the difference between a shoe, and a shirt, and a handbag, and a coat.
Image example
Fortunately, theres a data set called Fashion MNIST (not to be confused with handwriting MNIST data set thats your exercise) which gives a 70,000 images spread across 10 different items of clothing. These images have been scaled down to 28 by 28 pixels. Now usually, the smaller the better because the computer has less processing to do. But of course, you need to retain enough information to be sure that the features and the object can still be distinguished. If you look at the image you can still tell the difference between shirts, shoes, and handbags. So this size does seem to be ideal, and it makes it great for training a neural network. The images are also in grayscale, so the amount of information is also reduced. Each pixel can be represented in values from zero to 255 and so its only one byte per pixel. With 28 by 28 pixels in an image, only 784 bytes are needed to store the entire image. Despite that, we can still see whats in the image below, and in this case, its an ankle boot, right?
An image from the data set
You can know more about the fashion MNIST data set at this GitHub repository here. You can also download the data set from here.
So what will handling this look like in code? In the previous blog post, you learned about TensorFlow and Keras, and how to define a simple neural network with them. Here, you are going to use them to go a little deeper but the overall API should look familiar. The one big difference will be in the data. The last time you had just your six pairs of numbers, so you could hard code it. This time you have to load 70,000 images off the disk, so there will be a bit of code to handle that. Fortunately, its still quite simple because Fashion MNIST is available as a data set with an API call in TensorFlow.
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
Consider the code fashion_mnist.load_data()
. What we are doing here is creating an object of type MNIST and loading it from the Keras database. Now, on this class we are running a method called load_data()
which will return four lists to us train_images
, train_labels
, test_images
and test_labels
. Now, what are these you might wonder? So, when building a neural network like this, it's a nice strategy to use some of your data to train the neural network and similar data that the model hasn't yet seen to test how good it is at recognizing the images. So in the Fashion MNIST data set, 60,000 of the 70,000 images are used to train the network, and then 10,000 images, one that it hasn't previously seen, can be used to test just how good or how bad the model is performing. So this code will give you those sets.
An image from the data set and the code used
So for example, the training data will contain images like this one, and a label that describes the image like this. While this image is an ankle boot, the label describing it is number nine. Now, why do you think that is? There are two main reasons. First, of course, is that computers do better with numbers than they do with texts. Second, importantly, is that this is something that can help us reduce bias. If we labeled it as an ankle boot, we would be of course biased towards English speakers. But with it being a numeric label, we can then refer to it in our appropriate language be it English, Hindi, German, Mandarin, or here, even Irish. You can learn more about bias and techniques to avoid it here.
Okay. So now we will look at the code for the neural network definition. Remember last time we had a sequential with just one layer in it. Now we have three layers.
model = keras.Sequential([
keras.layers.Flatten(input_shape = (28, 28)),
keras.layers.Dense(128, activation = tf.nn.relu),
keras.layers.Dense(10, activation = tf.nn.softmax)
])
The important things to look at are the first and the last layers. The last layer has 10 neurons in it because we have ten classes of clothing in the data set. They should always match. The first layer is a Flatten layer with the input shaping 28 by 28. Now, if you remember our images are 28 by 28, so were specifying that this is the shape that we should expect the data to be in. Flatten takes this 28 by 28 square and turns it into a simple linear array. The interesting stuff happens in the middle layer, sometimes also called a hidden layer. This is 128 neurons in it, and Id like you to think about these as variables in a function. Maybe call them x1, x2 x3, etc. Now, there exists a rule that incorporates all of these that turns the 784 values of an ankle boot into the value nine, and similar for all of the other 70,000.
A neural net layer
So, what the neural net does is figure out w0
, w1
, w2
w n
such that (x1 * w1) + (x2 * w2) ... (x128 * w128) = y
Youll see that its doing something very, very similar to what we did earlier when we figured out y = 2x 1
. In that case, the two was the weight of x
. So, Im saying y = w1 * x1
, etc. The important thing now is to get the code working, so you can see a classification scenario for yourself. You can also tune the neural network by adding, removing, and changing layer size to see the impact.
As we discussed earlier to finish this example and writing the complete code we will use Tensor Flow 2.x, before that we will explore few Google Colaboratory tips as that is what you might be using. However, you can also use Jupyter Notebooks preferably in your local environment.
On Colab notebooks you can access your Google Drive as a network mapped drive in the Colab VM runtime.
# You can access to your Drive files using this path "/content
# /gdrive/My Drive/"
from google.colab import drive
drive.mount('/content/gdrive')
You can sync a Google Drive folder on your computer. Along with the previous tip, your local files will be available locally in your Colab notebook.
There are some resources from Google that explains that having a lot of files in your root folder can affect the process of mapping the unit. If you have a lot of files in your root folder on Drive, create a new folder and move all of them there. I suppose that having a lot of folders on the root folder will have a similar impact.
For some applications, you might need a hardware accelerator like a GPU or a TPU. Lets say you are building a CNN or so 1 epoch might be 90100 seconds on a CPU but just 56 seconds on a GPU and in milliseconds on a TPU. So, this is definitely helpful. You can go to
Runtime > Change Runtime Type > Select your hardware accelerator
This is called power level. Power level is an April fools joke feature that adds sparks and combos to cell editing. Access using
Tools > Settings > Miscellaneous > Select Power
Example of high power
The notebook is available here. If you are using a local development environment, download this notebook; if you are using Colab click the open in colab
button. We will also see some exercises in this notebook.
try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass
import tensorflow as tf
First, we use the above code to import TensorFlow 2.x, If you are using a local development environment you do not need lines 15.
Then, as discussed we use this code to get the data set.
fashion_mnist = keras.datasets.fashion_mnist
(train_images, train_labels), (test_images, test_labels) = fashion_mnist.load_data()
We will now use matplotlib
to view a sample image from the dataset.
import matplotlib.pyplot as plt
plt.imshow(training_images[0])
You can change the 0
to other values to get other images as you might have guessed. Youll notice that all of the values in the number are between 0 and 255. If we are training a neural network, for various reasons its easier if we treat all values as between 0 and 1, a process called normalizing and fortunately, in Python, its easy to normalize a list like this without looping. You do it like this:
training_images = training_images / 255.0
test_images = test_images / 255.0
Now in the next code block in the notebook we have defines the same neural net we earlier discussed. Ok, so you might have noticed a change in we use softmax
function. Softmax takes a set of values, and effectively picks the biggest one, so, for example, if the output of the last layer looks like [0.1, 0.1, 0.05, 0.1, 9.5, 0.1, 0.05, 0.05, 0.05], it saves you from fishing through it looking for the biggest value, and turns it into [0,0,0,0,1,0,0,0,0] The goal is to save a lot of coding!
We can then try to fit the training images to the training labels. Well just do it for 10 epochs to be quick. We spend about 50 seconds training it over five epochs and we end up with a loss of about 0.205. That means its pretty accurate in guessing the relationship between the images and their labels. Thats not great, but considering it was done in just 50 seconds with a very basic neural network, its not bad either. But a better measure of performance can be seen by trying the test data. These are images that the network has not yet seen. You would expect performance to be worse, but if its much worse, you have a problem. As you can see, its about 0.32 loss, meaning its a little bit less accurate on the test set. Its not great either, but we know were doing something right.
I have some questions and exercises for you 8 in all and I recommend you to go through all of them, you will also be exploring the same example with more neurons and things like that. So have fun coding.
You just made a complete fashion MNIST algorithm that can predict with pretty good accuracy the images of fashion items.
How can I stop training when I reach a point that I want to be at? What do I always have to hard code it to go for a certain number of epochs? So in every epoch, you can callback to a code function, having checked the metrics. If theyre what you want to say, then you can cancel the training at that point.
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs = {}):
if logs.get('loss') < 0.7:
print("\n Low loss so cancelling the training")
self.model.stop_training = True
Its implemented as a separate class, but that can be inline with your other code. It doesnt need to be in a separate file. In it, well implement the on_epoch_end
function, which gets called by the callback
whenever the epoch ends. It also sends a logs object which contains lots of great information about the current state of training. For example, the current loss is available in the logs, so we can query it for a certain amount. For example, here Im checking if the loss is less than 0.7 and canceling the training itself. Now that we have our callback, lets return to the rest of the code, and there are two modifications that we need to make. First, we instantiate the class that we just created, we do that with this code.
callbacks = myCallback()
Then, in my model.fit
, I used the callbacks parameter and pass it to this instance of the class.
model.fit(training_images, training_labels, epochs = 10, callbacks = [callbacks])
And now we pass the callback
object to the callback argument of the model.fit()
.
Use this notebook to explore more and see this code in action here. This notebook contains all the modifications we talked about.
You learned how to do classification using Fashion MNIST, a data set containing items of clothing. Theres another, similar dataset called MNIST which has items of handwriting the digits 0 through 9.
Write an MNIST classifier that trains to 99% accuracy or above, and does it without a fixed number of epochs i.e. you should stop training once you reach that level of accuracy.
Some notes:
It should succeed in less than 10 epochs, so it is okay to change epochs = to 10, but nothing larger
When it reaches 99% or greater it should print out the string Reached 99% accuracy so canceling training!
Use this code line to get the MNIST handwriting data set:
mnist = tf.keras.datasets.mnist
Heres a Colab notebook with the question and some starter code already written here
Wonderful! 😃 , you just coded for a handwriting recognizer with a 99% accuracy (thats good) in less than 10 epochs. Let explore my solution for this.
def train_mnist():
# Please write your code only where you are indicated.
# please do not remove # model fitting inline comments.
# YOUR CODE SHOULD START HERE
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('acc')>0.99):
print("/nReached 99% accuracy so cancelling training!")
self.model.stop_training = True
# YOUR CODE SHOULD END HERE
mnist = tf.keras.datasets.mnist
(x_train, y_train),(x_test, y_test) = mnist.load_data()
# YOUR CODE SHOULD START HERE
x_train, x_test = x_train / 255.0, x_test / 255.0
callbacks = myCallback()
# YOUR CODE SHOULD END HERE
model = tf.keras.models.Sequential([
# YOUR CODE SHOULD START HERE
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax)
# YOUR CODE SHOULD END HERE
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
# model fitting
history = model.fit(# YOUR CODE SHOULD START HERE
x_train,
y_train,
epochs=10,
callbacks=[callbacks]
# YOUR CODE SHOULD END HERE
)
# model fitting
return history.epoch, history.history['acc'][1]
I will just go through the important parts.
The callback class
class myCallback(tf.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs={}):
if(logs.get('acc')>0.99):
print("/nReached 99% accuracy so cancelling training!")
self.model.stop_training = True
The main neural net code
model = tf.keras.models.Sequential([
tf.keras.layers.Flatten(input_shape=(28, 28)),
tf.keras.layers.Dense(512, activation=tf.nn.relu),
tf.keras.layers.Dense(10, activation=tf.nn.softmax) ])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
history = model.fit(x_train, y_train, epochs=10, callbacks=[callbacks])
So, all you had to do was play around with the code and this gets done in just 5 epochs.
The solution notebook is available here.
Previous Blog in this series here
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>All the code used here is available at the GitHub repository here.
This is the first part of a series where I will be posting many blog posts about coding Machine Learning and Deep Learning algorithms. I believe in handson coding so we will have many exercises and demos which you can try yourself too. I would recommend you to play around with these exercises and change the hyperparameters and experiment with the code. That is the best way to learn and code efficiently. If you dont know what hyperparameters are dont worry. We will walk through Tensor Flows core capabilities and then also explore high level APIs in TensorFlow and so on.
If you are a person who is looking to get started with Tensor Flow, Machine Learning and Deep Learning but dont have any knowledge about Machine Learning you can do this. If you already know about Machine Learning and Deep Learning and dont know about Tensor Flow you can follow these. As you might be using some other frameworks or libraries like Caffe, Theano, CNTK, Apple ML or PyTorch and might also be knowing about the maths behind it you can skip the sections marked OPTIONAL. If you know about the basics of Tensor Flow then you can skip some of the blog posts.
No!!
Not at all, you dont need to install any software at all to complete the exercises and follow the code at all. We will be using Google Colaboratory. Colaboratory is a research tool for machine learning education and research. Its a Jupyter notebook environment that requires no setup to use. Colaboratory works with most major browsers and is most thoroughly tested with the latest versions of Chrome, Firefox, and Safari. And what more, it is completely free to use. All Colaboratory notebooks are stored in Google Drive. Colaboratory notebooks can be shared just as you would with Google Docs or Sheets. Simply click the Share button at the top right of any Colaboratory notebook, or follow these Google Drive file sharing instructions. Before, you ask me Colaboratory is really reliable and fast too.
However, you can also use a local development environment preferably Jupyter Notebooks but we will not be covering how to set up the local environment in these blogs.
The answer to this question is that in this blog we will cover both. If you are using Colab notebooks add this code.
try:
# %tensorflow_version only exists in Colab.
%tensorflow_version 2.x
except Exception:
pass
import tensorflow as tf
Coding has been the bread and butter for developers since the dawn of computing. Were used to creating applications by breaking down requirements into composable problems that can then be coded against. So for example, if we have to write an application that figures out a stock analytic, maybe the price divided by the ratio, we can usually write code to get the values from a data source, do the calculation and then return the result. Or if were writing a game we can usually figure out the rules. For example, if the ball hits the brick then the brick should vanish and the ball should rebound. But if the ball falls off the bottom of the screen then maybe the player loses their life.
We can represent that with this diagram. Rules and data go in answers come out. Rules are expressed in a programming language and data can come from a variety of sources from local variables up to databases. Machine learning rearranges this diagram where we put answers in data in and then we get rules out. So instead of us as developers figuring out the rules when should the brick be removed, when should the players life end, or whats the desired analytic for any other concept, what we will do is we can get a bunch of examples for what we want to see and then have the computer figure out the rules.
Traditional Programming vs Machine Learning
So consider this example, activity recognition. If Im building a device that detects if somebody is say walking and I have data about their speed, I might write code like this and if theyre running well thats a faster speed so I could adapt my code to this and if theyre biking, well thats not too bad either. I can adapt my code like this. But then I have to do golf recognition too, now my concept becomes broken. But not only that, doing it by speed alone of course is quite naive. We walk and run at different speeds uphill and downhill and other people walk and run at different speeds to us.
A traditional programming logic
The new paradigm is that I get lots and lots of examples and then I have labels on those examples and I use the data to say this is what walking looks like, this is what running looks like, this is what biking looks like and yes, even this is what golfing looks like. So, then it becomes answers and data in with rules being inferred by the machine. A machine learning algorithm then figures out the specific patterns in each set of data that determines the distinctiveness of each. Thats whats so powerful and exciting about this programming paradigm. Its more than just a new way of doing the same old thing. It opens up new possibilities that were infeasible to do before.
So, now I am going to show you the basics of creating a neural network for doing this type of pattern recognition. A neural network is just a slightly more advanced implementation of machine learning and we call that deep learning. But fortunately it's actually very easy to code. So, we're just going to jump straight into deep learning. We'll start with a simple one and then we'll move on to one that does computer vision in about 10 lines of code.
To show how that works, lets take a look at a set of numbers and see if you can determine the pattern between them. Okay, here are the numbers.
x = 1, 0, 1, 2, 3, 4
y = 3, 1, 1, 3, 5, 7
You easily figure this out y = 2x 1
You probably tried that out with a couple of other values and see that it fits. Congratulations, youve just done the basics of machine learning in your head.
Okay, heres our first line of code. This is written using Python and TensorFlow and an API in TensorFlow called keras. Keras makes it really easy to define neural networks. A neural network is basically a set of functions which can learn patterns.
model = keras.Sequential([keras.layers.Dense(units = 1, input_shape = [1])])
The simplest possible neural network is one that has only one neuron in it, and thats what this line of code does. In keras we use the word Dense
to define a layer of connected neurons. There is only one Dense
here means that there is only one layer and there is only single unit in it so there is only one neuron. Successive layers in keras are defined in a sequence so the word Sequential
. You define the shape of what's input to the neural network in the first and in this case the only layer, and you can see that our input shape is super simple. It's just one value. You've probably seen that for machine learning, you need to know and use a lot of math, calculus probability and the like. It's really good to understand that as you want to optimize your models but the nice thing for now about TensorFlow and keras is that a lot of that math is implemented for you in functions. There are two function roles that you should be aware of though and these are loss functions and optimizers.
model.compile(optimizer = 'sgd', loss = 'mean_squared_error')
This lines defines that for us. So, lets understand what it is.
Understand it like this the neural network has no idea of the relation between X and Y so it makes a random guess say y=x+3
. It will then use the data that it knows about, thats the set of Xs and Ys that weve already seen to measure how good or how bad its guess was. The loss function measures this and then gives the data to the optimizer which figures out the next guess. So the optimizer thinks about how good or how badly the guess was done using the data from the loss function. Then the logic is that each guess should be better than the one before. As the guesses get better and better, an accuracy approaches 100 percent, the term convergence is used.
Here we have used the loss function as mean squared error and optimizer as SGD or stochactic gradient descent.
Our next step is to represent the known data. These are the Xs and the Ys that you saw earlier. The np.array is using a Python library called numpy that makes data representation particularly enlists much easier. So here you can see we have one list for the Xs and another one for the Ys. The training takes place in the fit command. Here were asking the model to figure out how to fit the X values to the Y values. The epochs equals 500 value means that it will go through the training loop 500 times. This training loop is what we described earlier. Make a guess, measure how good or how bad the guesses with the loss function, then use the optimizer and the data to make another guess and repeat this. When the model has finished training, it will then give you back values using the predict method.
Heres the code of what we talked about.
xs = np.array([1.0, 0.0, 1.0, 2.0, 3.0, 4.0], dtype=float)
ys = np.array([3.0, 1.0, 1.0, 3.0, 5.0, 7.0], dtype=float)
model.fit(xs, ys, epochs=500)
print(model.predict([10.0]))
So, what output do you except 19, right?
But when you try this in the workbook yourself you will see it gives me a value close to 19 not 19. Like you might receive
[[18.987219]]
Why do you think this happens because the equation is y = 2x1
.
There are two main reasons:
The first is that you trained it using very little data. Theres only six points. Those six points are linear but theres no guarantee that for every X, the relationship will be Y equals 2X minus 1. Theres a very high probability that Y equals 19 for X equals 10, but the neural network isnt positive. So it will figure out a realistic value for Y. The the second main reason. When using neural networks, as they try to figure out the answers for everything, they deal in probability. Youll see that a lot and youll have to adjust how you handle answers to fit. Keep that in mind as you work through the code. Okay, enough talking. Now lets get handson and write the code that we just saw and then we can run it.
If you have used Jupyter before you will find Colab easy to use. If you have not used Colab before consider watching this intro to it.
Now you know how to use Colab so lets get started.
If you are using Jupyter Notebooks in your local environment download the code file here.
If you are using Colab follow the link here. You need to sign in with your Google account to run the code and receive a hosted run time.
I recommend you to spend some time with the notebook and try changing the epochs and see the effect on loss and accuracy and experiment with the code.
In this exercise youll try to build a neural network that predicts the price of a house according to a simple formula. So, imagine if house pricing was as easy as a house costs 50k + 50k per bedroom, so that a 1 bedroom house costs 100k, a 2 bedroom house costs 150k etc. How would you create a neural network that learns this relationship so that it would predict a 7 bedroom house as costing close to 400k etc.
Hint: Your network might work better if you scale the house price down. You dont have to give the answer 400it might be better to create something that predicts the number 4, and then your answer is in the hundreds of thousands etc.
Note: The maximum error allowed is 400 dollars
Make a new Colab Notebook and write your solution for this exercise and try it with some numbers the model has not seen.
Wonderful, you just created a neural net all by yourself and some of you must be feeling like this (its perfectly ok)
Source: me.me
Lets see my solution to this.
def house_model(y_new):
xs=[]
ys=[]
for i in range(1,10):
xs.append(i)
ys.append((1+float(i))*50)
xs=np.array(xs,dtype=float)
ys=np.array(ys, dtype=float)
model = keras.Sequential([keras.layers.Dense(units = 1, input_shape = [1])])
model.compile(optimizer='sgd', loss='mean_squared_error')
model.fit(xs, ys, epochs = 4500)
return (model.predict(y_new)[0]/100)
And this gives me the answer to the required tolerate limit.
Heres the Solution to this exercise here
If you want to try this out in Colab you can go to the link and click open on colab
button.
That was it, the basics. You even created a neural net all by yourself and made pretty accurate predictions too. Next, we will see about using some image classifiers with Tensor Flow so, stay tuned and explore till then.
Learn more about Colab here
TensorFlow website here
TensorFlow docs here
GitHub repository here
A wonderful Neural Nets research paper here
A wonderful visualization tool by TensorFlow here
TensorFlow models and data sets here
Some TensorFlow research papers
Large scale Machine Learning with TensorFlow here
A Deep Level Understanding of Recurrent Neural Network & LSTM with Practical Implementation in Keras & Tensorflow here
Next Blog in this series here
Hi everyone I am Rishit Dagli
If you want to ask me some questions, report any mistake, suggest improvements, give feedback you are free to do so by emailing me at
]]>Basic calculus
Python programming
A Jupyter Notebook is a powerful tool for interactively developing and presenting Data Science projects. Jupyter Notebooks integrate your code and its output into a single document. That document will contain the text, mathematical equations, and visualizations that the code produces directly in the same page. To get started with Jupyter Notebooks youll need to install the Jupyter library from Python. The easiest way to do this is via pip:
pip3 install jupyter
I always recommend using pip3
over pip2
these days since Python 2 wont be supported anymore starting January 1, 2020.
NumPy is one of the most powerful Python libraries. NumPy is an open source numerical Python library. NumPy contains a multidimensional array and matrix data structures. It can be utilized to perform a number of mathematical operations on arrays such as trigonometric, statistical and algebraic routines. NumPy is a wrapper around a library implemented in C. Pandas (we will later explore what they are) objects heavily relies on NumPy objects. Pandas extends NumPy.
Use pip to install NumPy package:
pip3 install numpy
Pandas has been one of the most popular and favorite data science tools used in Python programming language for data wrangling and analysis. Data is unavoidably messy in real world. And Pandas is seriously a game changer when it comes to cleaning, transforming, manipulating and analyzing data. In simple terms, Pandas helps to clean the mess.
pip3 install pandas
There are, of course, a huge range of data visualization libraries out there but if youre wondering why you should use Seaborn, put simply it brings some serious power to the table that other tools cant quite match. You could also use Matplotlib for creating the visualizations we will be doing.
pip3 install seaborn
As the name suggests we will be using it to get random partition of the datset.
pip3 install random
Remember the gradient descent formula for liner regression where Mean squared error was used but we cannot use Mean squared error here so replace with some error E
Gradient Descent 
Conditions for E:
Convex or as convex as possible
Should be function of theta
Should be differentiable
So use, Entropy =
So that was about the proof of Logistic regression algorithm now we implement the same above proved algorithm in our code.
We will use the breast cancer data set) to implement our logistic regression algorithm available on the UCI Machine Learning repository.
data set description
import numpy as np
import pandas as pd
import random
import seaborn as sns
df=pd.read_csv("breastcancer.data.txt",na_values=['?'])
df.drop(["id"],axis=1,inplace=True)
df["label"].replace(2,0,inplace=True)
df["label"].replace(4,1,inplace=True)
df.dropna(inplace=True)
full_data=df.astype(float).values.tolist()
df.head()
sns.pairplot(df)
There is no specific need of doing PCA as we have only 9 features but we get a simpler model eliminating 3 of these features, for code visit the github link.
full_data=np.matrix(full_data)
epoch=150000
alpha=0.001
x0=np.ones((full_data.shape[0],1))
data=np.concatenate((x0,full_data),axis=1)
print(data.shape)
theta=np.zeros((1,data.shape[1]1))
print(theta.shape)
print(theta)
test_size=0.2
X_train=data[:int(test_size*len(full_data)),:1]
Y_train=data[:int(test_size*len(full_data)),1]
X_test=data[int(test_size*len(full_data)):,:1]
Y_test=data[int(test_size*len(full_data)):,1]
def sigmoid(Z):
return 1/(1+np.exp(Z))
def BCE(X,y,theta):
pred=sigmoid(np.dot(X,theta.T))
mcost=np.array(y)*np.array(np.log(pred))np.array((1 y))*np.array(np.log(1pred))
return mcost.mean()
def grad_descent(X,y,theta,alpha):
h=sigmoid(X.dot(theta.T))
loss=hy
dj=(loss.T).dot(X)
theta = (alpha/(len(X))*dj)
return theta
cost=BCE(X_train,Y_train,theta)
print("cost before: ",cost)
theta=grad_descent(X_train,Y_train,theta,alpha)
cost=BCE(X_train,Y_train,theta)
print("cost after: ",cost)
def logistic_reg(epoch,X,y,theta,alpha):
for ep in range(epoch):
# update theta
theta=grad_descent(X,y,theta,alpha)
# calculate new loss
if ((ep+1)%1000 == 0):
loss=BCE(X,y,theta)
print("Cost function ",loss)
return theta
theta=logistic_reg(epoch,X_train,Y_train,theta,alpha)
print(BCE(X_train,Y_train,theta))
print(BCE(X_test,Y_test,theta))
Now we are done with the code
Multi class Neural Networks
Random Forest classifier Project link