I was recently messing around with the new TensorFlow.js libary. Since I can only do things with JS, I was glad to hear about this becoming available. From my brief experimentation, I have found the API to be extremely easy to use, given one has some basic Machine Learning concepts under one’s belt.
I devised a simple experiment which I didn’t particularly expect to be fruitful, but if I was able to get a functioning model it would be a proof of concept for handling actual datasets. As I suspected, the results were bad for predicting new examples, but I still think my efforts were productive enough to be worth sharing, and I definitely learned some things along the way.
My initial idea was to create a classifier for music genres, one which given a new example of a simple melody would be able to classify it as one of four: blues, pop, jazz, and metal. My approach to doing this was to model a melody as being an eight note sequence, where each note is represented by a number from 1 to 12 corresponding to the musical scale, and 0 meaning no note was played at that beat. These are the initial melodies I came up with:
const melodies = { blues: [ [1, 0, 1, 3, 0, 1, 1, 0], [1, 1, 3, 1, 3, 1, 3, 1], [5, 5, 6, 5, 7, 5, 6, 5], [1, 1, 0, 1, 0, 1, 0, 1], ],
pop: [ [1, 0, 1, 1, 0, 1, 0, 12], [1, 3, 1, 3, 1, 5, 5, 5], [1, 1, 1, 1, 1, 12, 12, 3], [6, 6, 5, 3, 0, 3, 1, 3], ],
jazz: [ [1, 5, 8, 1, 1, 0, 1, 0], [8, 7, 6, 5, 4, 3, 1, 5], [1, 4, 6, 7, 8, 7, 6, 4], [3, 10, 0, 0, 5, 3, 5, 10], ],
metal: [ [1, 1, 2, 1, 11, 4, 1, 2], [1, 4, 7, 10, 7, 4, 1, 4], [1, 1, 1, 11, 1, 2, 2, 2], [1, 4, 0, 4, 6, 6, 0, 9], ], };
Then I converted the 8 note sequence into an 8×12 matrix, where instead of a number from 1 to 8, the note that was played would be a 1 at the position from 1 to 12 of that note. So the first blues melody would look like:
- [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
- [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
//function to convert melody to matrix format
function convertMelody(melody) { const converted = [];
for (let i = 0; i < melody.length; i += 1) { const note = melody[i]; const beat = new Array(12).fill(0); if (note) { beat[note - 1] = 1; } converted.push(beat); } return converted; }
Now the machine learning part. The key building block to TensorFlow.js is the tensor. At tensor is like an array, but generalizable to any number of dimensions, and provides an interface for abstract operations and transformations. Also in the case of TensorFlow.js, tensors take advantage of the GPU using WebGL, and so provide cleanup operations to free that up as it becomes cluttered.
So for each of my melodies, I create a tensor from it and collect the tensors into an array:
const convertedMelodies = [];
for (let i = 0; i < 16; i += 1) { const genre = Object.keys(melodies)[i % 4]; const song = melodies[genre][Math.floor(i/4)];
// convert melody to 2-D matrix const convertedMelody = convertMelody(song);
// convert matrix to 2-D tensor const tensor = tf.tensor2d(convertedMelody); convertedMelodies.push(tensor); }
In my initial attempt at doing this, I created a dense, fully-connected input layer with one “neuron” per element in the 2D Tensor. Then, I intermediately flatten it into a 1D tensor using the flatten. The final layer is 4 neurons, one for each music genre. TensorFlow provides the sequential model, which just means a stack where each layer feeds into the next one after the other with no skipping layers. Code looks like this:
const firstLayer = tf.layers.dense({ units: 96, inputShape: [8, 12], activation: 'relu', //Rectified Linear Units });
const flatten = tf.layers.flatten();
const thirdLayer = tf.layers.dense({ units: 4, activation: 'softmax', });
const model = tf.sequential();
model.add(firstLayer); model.add(flatten); model.add(thirdLayer);
Units is how many neurons, and the inputShape describes the shape of the tensor to expect as input. This is only necessary on the first layer. I use a ‘relu’ activation function on the first layer which is short for Rectified Linear Units. You know a “rectified” activation function must be a good one. ReLu just means what comes in goes out: the output of the neuron is proportionate to the input. If there is .5 “charge” coming in, the neuron fires off .5 in output. As an alternative example, sometimes neurons will use some sort of binary threshold, where the neuron either fires given sufficient input, or doesn’t at all— which is more similar to how actual neurons work.
The softmax creates a distribution where the sum of all the values equals 1. In doing this, the values correspond to the probability distribution that each output neuron is the correct one.
Now to train the model. If you followed my last post on gradient descent, you already know how this works. I choose a learning rate of .2 (which corresponds to the step from the last article) and categorical cross-entropy as the loss function. That last part just means it assumes a value of 1 for the correct category and zero for all the others, taking the difference from the softmax distribution above.
const LEARNING_RATE = 0.2;
//stochastic gradient descent const optimizer = tf.train.sgd(LEARNING_RATE);
model.compile({ optimizer: optimizer, loss: 'categoricalCrossentropy', metrics: ['accuracy'], // include accuracy metrics in result });
In this case I’m going to update the weights every round. You can optimize by batching operations so as not to overload the GPU. Here I’m doing one datapoint per batch.
async function train() { for (let i = 0; i < convertedMelodies.length; i += 1) { // the 1 at the front means this is a batch of size 1 const batch = convertedMelodies[i].reshape([1, 8, 12]);
// what is the correct category? const labelIndex = i % 4; let label = new Array(4).fill(0); label[labelIndex] = 1; label = tf.tensor1d(label).reshape([1, 4]);
// train const hist = await model.fit( batch, label, { batchSize: 1, epochs: 1, } );
// print some stats const loss = hist.history.loss[0]; const accuracy = hist.history.acc[0]; console.log(loss, accuracy) } }
Then test it:
const test = [1, 3, 5, 1, 6, 5, 3, 1]; const convertedTest = convertMelody(test); const tensorTest = tf.tensor2d(convertedTest); console.log(model.predict(tensorTest.reshape([1, 8, 12])).print())
My result:
Tensor [[0.2256236, 0.236722, 0.2450776, 0.2925768],]
So the model gives a high probability that the above melody is a heavy metal melody. You can try playing it on your instrument of choice to see if you agree with that conclusion.
Conclusion
With only 16 samples, it would be a surprise indeed if the model predicted new melodies accurately. Also, the examples I provided may not have been ideal exemplars for their respective genres. This is another case where bigger datasets are better.
I also thought about pooling the input into 12 neurons — one for each note. Creating an intermediate dense layer like this would have the effect of masking all of the information about the temporal position of the notes, and instead would be an aggregated statistic about which notes were played. A way around this could be to have both the initial layer and second layer connected to the final layer: that way both contribute to the result. But since I don’t know what I’m doing, I stayed with the simpler option.
I hope this gave some insight into how to use the API, as well as that I was able to show how simple this library is to use. There is a ton more — I haven’t even scratched the surface here. Hoping to try again with actual data. If you have any ideas about future directions or projects of your own, be sure to let me know below, and I can provide rudimentary advice or steal your idea. Thanks!