A Technical Blog

Index

Asynchronous and Distributed File Loading

Today we are going to look at loading large datasets in a asynchronous and distributed fashion. In a lot of circumstances it is best to work with such datasets in an entirely distributed fashion, but for this demonstration we will be assuming that that is not possible, because you need to Channel it into some serial process. But it doesn’t have to be the case. Anyway, we use this to further introduce Channels and RemoteChannels. I have blogged about Channels before, you make wish to skim that first. That article focused on single producer single consumer. This post will focus on multiple producers, single consumer. (though you’ll probably be able to workout multiple consumers from there, it is pretty semetrical).

more ...

These are a few of my Favourite Things (that are coming with Julia 1.0)

If I were more musically talented I would be writing a songArguments that are destructured, and operator characters combine-ed; Loop binding changes and convert redefine-ed… ♫ no none of that, please stop.

Technically speaking these are a few of my Favourite Things that are in julia 0.7-alpha. But since since 1.0 is going to be 0.7 with deprecations removed, We can look at it as a 1.0 list.

Many people are getting excited about big changes like Pkg3, named tuples, field access overloading, lazy broadcasting, or the parallel task runtime (which isn’t in 0.7 alpha, but I am hopeful for 1.0) I am excited about them too, but I think they’re going to get all the attention they need. (If not then they deserve a post of their own each, not going to try and squeeze them into this one.) Here are some of the smaller changes I am excited about.

These are excepts from 0.7-alpha NEWS.md

more ...

Optimizing your diet with JuMP

I’ve been wanting to do a JuMP blog post for a while. JuMP is a Julia mathematical programing library. It is to an extent a DSL for describing constrained optimisation problems.

A while ago, a friend came to me who was looking to get “buff”, what he wanted to do, was maximise his protein intake, while maintaining a generally healthy diet. He wanted to know what foods he should be eating. To devise a diet.

If one thinks about this, this is actually a Linear Programming problem – constrained linear optimisation. The variables are how much of each food to eat, and the contraints are around making sure you have enough (but not too much) of all the essential vitamins and minerals.

Note: this is a bit of fun, in absolutely no way do I recommend using the diets the code I am about to show off generates. I am in no way qualified to be giving dietry or medical advice, etc. But this is a great way to play around with optimisation.

more ...

String Types in Julia

A shortish post about the various string type in Julia 0.6, and it’s packages. This post covers Base.String, Base.SubString, WeakRefStrings.jl, InternedStrings.jl, ShortStrings.jl and Strs.jl; and also mentioneds StringEncodings.jl. Thanks to Scott P Jones, who helped write the section on his Strs.jl package.

more ...

DataDeps.jl -- Repeatabled Data Setup for Repeatable Science

This is just a quick post to show off DataDeps.jl. DataDeps.jl is the long discussed BinDeps for data. At it’s heart it is a tool for reproducible data science. It means anyone trying to run your code later, in a different environment isn’t faffing around trying to work out where to download the data from and how to connect it to your scripts.

more ...

7 Binary Classifier Libraries in Julia

I wished to do some machine learning for binary classification. Binary classification is perhaps the most basic of all supervised learning problems. Unsurprisingly julia has many libraries for it. Today we are looking at: LIBLINEAR (linear SVMs), LIBSVM (Kernel SVM), XGBoost (Extreme Gradient Boosting), DecisionTrees (RandomForests), Flux (neural networks), TensorFlow (also neural networks).

In this post we are only concentrating on their ability to be used for binary classification. Most (all) of these do other things as well. We’ll also not really be going into exploring all their options (e.g. different types of kernals).

Furthermore, I’m not rigeriously tuning the hyperparameters so this can’t be considered a fair test for performance. I’m also not performing preprocessing (e.g. many classifies like it if you standarise your features to zero mean unit variance). You can look at this post more as talking above what code for that package looks like, and this is roughly how long it takes and how well it does out of the box.

It’s more of a showcase of what packages exist.
For TensorFlow and Flux, you could also treat this as a bit of a demo in how to use them to define binary classifiers. Since they don’t do it out of the box.

more ...

Thread Parallelism in Julia

Julia has 3 kinds of parallelism. The well known, safe, slowish and easyish, distributed parallelism, via pmap, @spawn and @remotecall. The wellish known, very safe, very easy, not-actually-parallelism, asynchronous parallelism via @async. And the more obscure, less documented, experimental, really unsafe, shared memory parallelism via @threads. It is the last we are going to talk about today.

I’m not sure if I can actually teach someone how to write threaded code. Let alone efficient threaded code. But this is me giving it a shot. The example here is going to be fairly complex. For a much simpler example of use, on a problem that is more easily parallelizable, see my recent stackoverflow post on parallelizing sorting.

(Spoilers: in the end I don’t manage to extract any serious performance gains from parallelizing this prime search. Unlike parallelizing that sorting. Paralising sorting worked out great)

more ...

Lazy Sequences in Julia

I wanted to talk about using Coroutines for lazy sequences in julia. Because I am rewriting CorpusLoaders.jl to do so in a nondeprecated way.

This basically corresponds to C# and Python’s yield return statements. (Many other languages also have this but I think they are the most well known).

The goal of using lazy sequences is to be able to iterate though something, without having to load it all into memory. Since you are only going to be processing it a single element at a time. Potentially for some kind of moving average, or for acausal language modelling, a single window of elements at a time. Point is, at no point do I ever want to load all 20Gb of wikipedia into my program, nor all 100Gb of Amazon product reviews.

And I especially do not want to load $\infty$ bytes of every prime number.

more ...

Using julia -L startupfile.jl, rather than machinefiles for starting workers.

If one wants to have full control over the worker process to method to use is addprocs and the -L startupfile.jl commandline arguement when you start julia See the documentation for addprocs.

The simplest way to add processes to the julia worker is to invoke it with julia -p 4. The -p 4 argument says start 4 worker processes, on the local machine. For more control, one uses julia --machinefile ~/machines Where ~/machines is a file listing the hosts. The machinefile is often just a list of hostnames/IP-addresses, but sometimes is more detailed. Julia will connect to each host and start a number of workers on each equal to the number of cores.

Even the most detailed machinefile doesn’t give full control, for example you can not specify the topology, or the location of the julia exectuable.

For full control, one shoud invoke addprocs directly, and to do so, one should use julia -L startupfile.jl

more ...

Intro to Machine Learning with TensorFlow.jl

In this blog post, I am going to go through as series of neural network structures. This is intended as a demonstration of the more basic neural net functionality. This blog post serves as an accompanyment to the introduction to machine learning chapter of the short book I am writing ( Currently under the working title “Neural Network Representations for Natural Language Processing”)

more ...

TensorFlow's SVD is significantly worse than LAPACK's, but still very good

TensorFlow’s SVD is significantly less accurate than LAPACK’s (i.e. julia’s and numpy/SciPy’s backing library for linear algebra). But still incredibly accurate, so probably don’t panic. Unless your matrices have very large ($>10^6$) values, then the accuracy difference might be relevant for you (but probably isn’t). However, both LAPACK and TensorFlow are not great then – LAPACK is still much better.

more ...

Plain Functions that Just Work with TensorFlow.jl

Anyone who has been stalking me may know that I have been making a fairly significant number of PR’s against TensorFlow.jl. One thing I am particularly keen on is making the interface really Julian. Taking advantage of the ability to overload julia’s great syntax for matrix indexing and operations. I will make another post going into those enhancements sometime in the future; and how great julia’s ability to overload things is. Probably after #209 is merged. This post is not directly about those enhancements, but rather about a emergant feature I noticed today. I wrote some code to run in base julia, but just by changing the types to Tensors it now runs inside TensorFlow, and on my GPU (potentially).

more ...

JuliaML and TensorFlow Tuitorial

This is a demonstration of using JuliaML and TensorFlow to train an LSTM network. It is based on Aymeric Damien’s LSTM tutorial in Python. All the explinations are my own, but the code is generally similar in intent. There are also some differences in terms of network-shape.

The task is to use LSTM to classify MNIST digits. That is image recognition. The normal way to solve such problems is a ConvNet. This is not a sensible use of LSTM, after all it is not a time series task. The task is made into a time series task, by the images arriving one row at at a time; and the network is asked to output which class at the end after seeing the 28th row. So the LSTM network must remember the last 27 prior rows. This is a toy problem to demonstrate that it can.

more ...

JuliaPro beta 0.5.02 first impressions

JuliaPro is JuliaComputing’s prepackaged bundle of julia, with Juno/Atom IDE, and a bunch of packages. The short of it is: there is no reason not to install julia this way on a Mac/Windows desktop – it is more convenient and faster to setup, but it is nothing revolutionary.

more ...

Julia and OpenFST a glue story

Julia as a Glue Language

Julia is a great language for scientific and technical programming. It is more or all I use in my research code these days. It gets a lot of attention for being great for scientific programming because of its: great matrix syntax, high speed and optimisability, foreign function interfaces, range of scientific libraries, etc etc. It has all that sure. (Though it is still in alpha, so many things are a bit broken at times.) One things that is under-mentioned is how great it is as a “glue” language.

more ...

An Algebraic Structure For Path Schema (Take 2)

This is a second shot at expressing Path Schema as algebraic objects. See my first attempt. The definitions should be equivelent, and any places they are not indicates a deficency in one of the defintions. This should be a bit more elegant, than before. It is also a bit more extensive. Note that and are now defined differently, and and are what one should be focussing on instead, this is to use the free monoid convention.

In general a path can be described as a a hierachical index, onto a directed multigraph. Noting that “flat” sets, trees, and directed graphs are all particular types of directed multigraphs.

To repeat the introduction:

This post comes from a longish discussion with Fengyang Wang (@TotalVerb), on the JuliaLang Gitter. Its pretty cool stuff.

It is defined here independent of the object (filesystem, document etc) being indexed. The precise implementation of the algebric structure differs, depending on the Path types in question, eg Filesystem vs URL, vs XPATH.

This defintion is generally applicable to paths, such as:

  • File paths
  • URLs
  • XPath
  • JSON paths
  • Apache ZooKeeper Paths
  • Swift Paths (Server/Container/Psuedofolder/Object)
  • Globs

The defintion whch follows provides all the the expected functionality on paths

more ...

An Algebraic Structure For Path Schema (Take 1)

Note I have written a much improved version of this. See the new post.

This post comes from a longish discussion with Fengyang Wang (@TotalVerb), on the JuliaLang Gitter. Its pretty cool stuff.

In general a path can be described as a a heirachical index. It is defined here independent of the object (filesystem, document etc) being indexed. The precise implementation of the algebric structure differs, depending on the Path types in question, eg Filesystem vs URL, vs XPATH.

This defintion is generally applicable to paths, such as:

  • File paths
  • URLs
  • XPath
  • JSON paths
  • Apache ZooKeeper Paths
  • Swift Paths (Server/Container/Psuedofolder/Object)
  • Globs

The defintion whch follows provides all the the expected functionality on paths

more ...