DataDeps.jl -- Repeatabled Data Setup for Repeatable Science

This is just a quick post to show off DataDeps.jl. DataDeps.jl is the long discussed BinDeps for data. At it’s heart it is a tool for reproducible data science. It means anyone trying to run your code later, in a different environment isn’t faffing around trying to work out where to download the data from and how to connect it to your scripts.

I am not going to go into too much detail here, It is all documented in the package README. This is more of a demo.

Like most of my blog-posts this is available as Jupyter Notebook on my Github.

The key features of DataDeps.jl are:

Repeatable and idempotent data setup/downloading
Flexible sortage locations that do not require change to code
Checksum validated downloads
Messages to users before downloading the first time.

Enough promo, on with the examples:

Example 1: Word Embeddings, data for your model

Your system might need word embeddings. They are pretty important for a lot of NLP research. If you want to use pretrained ones, they can be pretty big though. Too big for adding to your repository. They are definitely data that your model depends on.

Embeddings.jl exposes a large number of pretrained embeddings, basically using the process demonstrated below. It is just a front-end over a few hundred datadeps.

Input:

using DataDeps, Plots

First we are going to register a DataDep. In a package this would go in your modules __init__ function. We are going to declare a data dependency for some word embeddings.

Input:

register(DataDep("FastText en",
    """
    Dataset: FastText Word Embeddings for English.
    Author: Bojanowski et. al. (Facebook)
    License: CC-SA 3.0
    Website: https://github.com/facebookresearch/fastText/blob/master/pretrained-vectors.md
    
    300 dimentional FastText word embeddings, trained on Wikipedia
    Citation: P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information

    Notice: this file is over 6.2GB
    """,
    "https://s3-us-west-1.amazonaws.com/fasttext-vectors/wiki.en.vec",
    "ba5420ac217fb34f15f58ded0d911a4370dfb1f3341fa7511a49ae74c87de282"
));

Let’s see what we’ve got. Rather than needing to refer to your data by a path on disk, DataDeps.jl allows you to refer to it by name with a datadep string macro. This resolves into a path to it on disk – even if that mean downloading it first. (But because I’ve run this code before, it was already downloaded so no download occurs this time.)

Input:

readdir(datadep"FastText en")

Output:

1-element Array{String,1}:
 "wiki.en.vec"

Now we are going to define a function to load up those word embeddings. DataDeps.jl doesn’t handle loading data – just downloading data. To load the data would require understanding a lot about its format. That is left to the user, or to other packages like MLDatasets.jl that know what the data they are consuming is.

Notice here the use of filepath=datadep"FastText en/wiki.en.vec" as an optional argument. This is a common pattern that I recommend using with DataDeps.jl. It means if the user provides a path, the datadep string is never evaluated. Which in turn means the data download will not be triggered (though in this case it has already been).

Input:

function word_embeddings(words, filepath=datadep"FastText en/wiki.en.vec")
    embs = Dict{String, Vector{Float32}}()
    
    load_words=collect(words).*" " # add a space so that we can use startswith to check for complete matches
    for line in Iterators.drop(eachline(filepath), 1) # skip header on the first line
        if any(startswith.(line, load_words))           
            toks = split(line)
            word = first(toks)
            embs[word] = parse.(toks[2:end])
        end
    end
    embs
end

Output:

word_embeddings (generic function with 2 methods)

Input:

# Note: these categories are not mutually exclusive
# orange is a food and a color etc
# I've broken categories arbitrarily

foods = split("turkey chicken duck apple banana cheese sausage milk egg")
sports = split("cricket golf baseball football soccer rugby run walk swim dive")
colors = split("orange yellow blue green red")
tools = split("tape glue nails hammer saw drill")
objects = split("phone car truck record shed house castle rook")
other = split("down up danger risk reward  new old fresh stale glass stone china wood face");

Input:

# A bit of a metaprogramming hack
category_lookup = Dict{String,Symbol}()
for cat in [:foods, :sports, :colors, :tools, :objects, :other]
    var = eval(cat)
    for word in var
        category_lookup[word] = cat
    end
end;

Input:

embs_dict = word_embeddings(keys(category_lookup))

Output:

Dict{String,Array{Float32,1}} with 52 entries:
  "tape"     => Float32[0.31498, -0.041574, -0.096835, -0.087724, -0.078622, -0…
  "egg"      => Float32[0.23074, 0.014205, -0.3986, 0.057022, 0.032088, 0.51731…
  "risk"     => Float32[-0.22041, 0.043005, -0.16092, 0.42121, -0.31625, -0.129…
  "banana"   => Float32[-0.30111, -0.19338, 0.035946, 0.040627, 0.24098, -0.356…
  "china"    => Float32[0.065689, 0.22287, -0.02309, 0.22571, -0.40829, 0.20209…
  "walk"     => Float32[0.053497, 0.17538, -0.12849, 0.068115, -0.34802, -0.206…
  "nails"    => Float32[0.38722, -0.087961, -0.33036, 0.25719, -0.10132, 0.3656…
  "rook"     => Float32[0.16451, 0.044197, -0.31782, 0.04001, -0.1339, 0.26903,…
  "down"     => Float32[-0.17515, 0.021885, -0.25901, 0.20048, -0.19916, -0.056…
  "glue"     => Float32[0.22836, 0.14853, -0.36956, 0.27853, -0.40004, 0.12266,…
  "face"     => Float32[-0.14819, 0.16016, -0.31916, 0.28058, -0.34405, 0.01762…
  "old"      => Float32[-0.063426, -0.021367, 0.056441, 0.1353, 0.12058, 0.2589…
  "turkey"   => Float32[-0.51473, 0.02193, -0.24228, 0.22971, -0.61302, -0.1356…
  "cheese"   => Float32[0.20742, 0.04882, 0.078373, -0.24411, -0.24788, 0.35715…
  "truck"    => Float32[-0.049757, -0.24887, 0.077345, 0.38045, -0.50517, -0.20…
  "cricket"  => Float32[-0.15663, 0.18529, -0.074915, 0.80175, 0.37125, 0.01455…
  "shed"     => Float32[0.076611, -0.17981, -0.26234, 0.57905, -0.25095, -0.026…
  "reward"   => Float32[-0.05137, -0.096855, -0.13516, 0.029344, -0.13654, -0.2…
  "run"      => Float32[0.19979, 0.20623, -0.22006, 0.084749, -0.26972, -0.0459…
  "baseball" => Float32[0.044589, -0.089292, 0.18082, 0.54954, -0.25423, -0.247…
  "saw"      => Float32[-0.027857, -0.016778, -0.28143, 0.42337, 0.14235, 0.063…
  "danger"   => Float32[-0.15803, -0.18926, -0.35727, 0.25279, -0.4704, -0.0768…
  "red"      => Float32[-0.1397, -0.19608, 0.44096, 0.084868, 0.28052, -0.16625…
  "swim"     => Float32[0.17168, 0.18579, -0.60043, 0.36278, -0.24944, -0.26992…
  "green"    => Float32[-0.27572, -0.099347, 0.30856, 0.24058, 0.1654, 0.031648…
  ⋮          => ⋮

Let’s visualise them. Had to do a bit of hacking around with Plots.jl to get the visualisation I want. Color according to category, text according to index

Input:

index = collect(keys(embs_dict))
embs = hcat(values(embs_dict)...)
categories = getindex.(category_lookup, index)
using Plots
using TSne # Note this package is not registered, you'll have to clone it
embs_dr = tsne(Float64.(embs)', 2, 0, 1000, 10.0)' # TSne.jl is still sideways, still only works with Float64s


## Plot it
function groupup(data::T, groupby=data) where {T}
    group_ind = Dict(reverse.(collect(enumerate(unique(groupby)))))
    ret = [eltype(T)[] for _ in 1:length(group_ind)]
    for (datum,group) in zip(data,groupby)
        push!(ret[group_ind[group]], datum)
    end
    ret
end

Input:

xss = groupup(embs_dr[1,:],categories)
yss = groupup(embs_dr[2,:],categories)
textss = groupup(index, categories)
plot(); #clear the ploat
for (xs, ys, texts) in zip(xss, yss,textss)
    scatter!(xs, ys, series_annotations= texts, alpha=0.4,
    size=(800,600)
    )
end
plot!(legend=false)

Output:

svg

That worked pretty well, I think tweaking the perplexity on TSNe a bit more could get better result. Of course as with all dimentionality reduction some information is going to be lost and not expressed in the final form. Sill I think it has done well, the FastText embeddings are pretty good. Notice that it has located turkey and china together, pressumably because their embeddings reflect that they are countries. FastText actually doesn’t capture countries during training as far as I can tell, I believe they attempt to remove all proper nouns during preprocessing (E.g. England is not in their), but I guess China and Turkey slip through as they are also regular nouns. Notice also that the ball-sports are located together, separately from movement types like swim, walk and run. New is near Fresh and old is near stale.

Example 2: WordNet.jl: Data for your package

I love WordNet.jl. WordNet is a pretty fundermental tool for NLP research (though it is getting a bit dated). WordNet.jl is the julia binding. Understandably, @jbn doesn’t want to include the WordNet database in the repository. Because of concerns about the filesize, and about redistributing someone elses work. However, it is fully dependent on having that data. So by not automatically installing that it makes it hard to build anything on top of it. Now this could be done with BinDeps for example or just by sticking a download into /deps/build.jl, but that isn’t great for this. There is no chance to display a message about the data’s real owner, and the location of the data wouldn’t be flexible – a path would need to be hardcoded in.

DataDeps.jl expressly designed for these concerns.
(Update 2018: WordNet.jl now uses DataDeps.jl to do exactly this.)

What do we have to do to get WordNet.jl working without any manual data configuration by the user?

Input:

using WordNet, DataDeps

Input:

register(DataDep("WordNet 3.0",
    """
    Dataset: WordNet 3.0
    Website: https://wordnet.princeton.edu/wordnet

    George A. Miller (1995). WordNet: A Lexical Database for English.
    Communications of the ACM Vol. 38, No. 11: 39-41.

    Christiane Fellbaum (1998, ed.) WordNet: An Electronic Lexical Database. Cambridge, MA: MIT Press.

    License: 
    WordNet Release 3.0 This software and database is being provided to you, the LICENSEE, by Princeton University under the following license. By obtaining, using and/or copying this software and database, you agree that you have read, understood, and will comply with these terms and conditions.: Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty is hereby granted, provided that you agree to comply with the following copyright notice and statements, including the disclaimer, and that the same appear on ALL copies of the software, database and documentation, including modifications that you make for internal use or for distribution. WordNet 3.0 Copyright 2006 by Princeton University. All rights reserved. THIS SOFTWARE AND DATABASE IS PROVIDED "AS IS" AND PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR IMPLIED. BY WAY OF EXAMPLE, BUT NOT LIMITATION, PRINCETON UNIVERSITY MAKES NO REPRESENTATIONS OR WARRANTIES OF MERCHANT- ABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE LICENSED SOFTWARE, DATABASE OR DOCUMENTATION WILL NOT INFRINGE ANY THIRD PARTY PATENTS, COPYRIGHTS, TRADEMARKS OR OTHER RIGHTS. The name of Princeton University or Princeton may not be used in advertising or publicity pertaining to distribution of the software and/or database. Title to copyright in this software, database and any associated documentation shall at all times remain with Princeton University and LICENSEE agrees to preserve same.
    """,
    "http://wordnetcode.princeton.edu/3.0/WNdb-3.0.tar.gz",
    "658b1ba191f5f98c2e9bae3e25c186013158f30ef779f191d2a44e5d25046dc8";
    post_fetch_method = unpack
));

Output:

WARNING: Over-writing registration of the datadep: WordNet 3.0

Input:

WordNet.DB() = DB(datadep"WordNet 3.0")

That is it, that declaration of the datadep via the registration block (Mostly just copy-pasted from the WordNet website),
and the addition of a method to the DB constructor, and we are done.

Input:

db = DB()
lemma = db['n', "turkey"]

Output:

turkey.n

Input:

ss = synsets(db, lemma)

Output:

5-element Array{WordNet.Synset,1}:
 (n) Meleagris gallopavo, turkey (large gallinaceous bird with fan-shaped tail; widely domesticated for food)                                                                                            
 (n) Republic of Turkey, Turkey (a Eurasian republic in Asia Minor and the Balkans; on the collapse of the Ottoman Empire in 1918, the Young Turks, led by Kemal Ataturk, established a republic in 1923)
 (n) joker, turkey (a person who does something thoughtless or annoying; "some joker is blocking the driveway")                                                                                          
 (n) turkey (flesh of large domesticated fowl usually roasted)                                                                                                                                           
 (n) bomb, dud, turkey (an event that fails badly or is totally ineffectual; "the first experiment was a real turkey"; "the meeting was a dud as far as new business was concerned")                     

Input:

expanded_hypernyms(db, ss[2])

Output:

8-element Array{WordNet.Synset,1}:
 (n) land, country, state (the territory occupied by a nation; "he returned to the land of his birth"; "he visited several European countries")  
 (n) administrative division, administrative district, territorial division (a district defined for administrative purposes)                     
 (n) district, dominion, territory, territorial dominion (a region marked off for administrative or other purposes)                              
 (n) region (a large indefinite location on the surface of the Earth; "penguins inhabit the polar regions")                                      
 (n) location (a point or extent in space)                                                                                                       
 (n) physical object, object (a tangible and visible entity; an entity that can cast a shadow; "it was full of rackets, balls and other objects")
 (n) physical entity (an entity that has physical existence)                                                                                     
 (n) entity (that which is perceived or known or inferred to have its own distinct existence (living or nonliving))                              

Example 3: 538: Aveneger’s Comic Book Characters: DataDepsGenerators.jl

So this last example is a change to show off DataDepsGenerators.jl. It does the kinda fragile webscraping to generate code for registration blocks, which you can then edit and include into your project that uses DataDeps.jl.

We are going to load-up 538’s dataset on Marvel Comic book characters.

Input:

using DataDeps, DataDepsGenerators

Input:

generate(GitHub(), "https://github.com/fivethirtyeight/data/tree/master/avengers") |> print

Output:

register(DataDep(
    "Avengers",
    """
    	Dataset: Avengers
	Website: https://github.com/fivethirtyeight/data/tree/master/avengers
	# Avengers
	
	This folder contains the data behind the story [Joining The Avengers Is As Deadly As Jumping Off A Four-Story Building](http://fivethirtyeight.com/features/avengers-death-comics-age-of-ultron).
	
	`avengers.csv` details the deaths of Marvel comic book characters between the time they joined the Avengers and April 30, 2015, the week before Secret Wars #1.
	
	Header | Definition
	---|---------
	`URL`| The URL of the comic character on the Marvel Wikia
	`Name/Alias` | The full name or alias of the character
	`Appearances` | The number of comic books that character appeared in as of April 30
	`Current?` | Is the member currently active on an avengers affiliated team?...
	(Read more at https://rawgit.com/fivethirtyeight/data/master/avengers/README.md)
	
	LICENSE
	--------
	Attribution 4.0 International
	
	=======================================================================
	...
	(Read more at https://rawgit.com/fivethirtyeight/data/master/LICENSE)
    """,
    Any["https://cdn.rawgit.com/fivethirtyeight/data/11b50e9b1a2e12d5f366f1b5b4c048a71dc29544/avengers/README.md", "https://cdn.rawgit.com/fivethirtyeight/data/11b50e9b1a2e12d5f366f1b5b4c048a71dc29544/avengers/avengers.csv"],
    
))

Now DataDepsGenerators.jl isn’t perfect, you do have to check it by hand, and probably edit it a bit. For example because of how 538 laid out their github repo (see issue fivethirtyeight/data/#101, DataDepsGenerators thinks this data is MIT lisenced. It is actually Creative Commons Attribution 4.0 International License. We’re in complaince with that notice anyway, as it includes (I believe, but IANAL) all the attribution information we need.

Not also it has failed to give it a good datadep name.

You shouldn’t do this in your packages, but for an demo like this, we can register that generated data dep immediately. We’ll pass in the name to the generator this time too.

Input:

eval(parse(generate(GitHub(), "https://github.com/fivethirtyeight/data/tree/master/avengers", "538 Avengers")));

Time to load it up, and then we will do some visualisations.

Input:

using FileIO, CSVFiles, DataFrames, Plots
characters = DataFrame(load(datadep"538 Avengers/avengers.csv"; escapechar='"'));

Lets see what the distribution of how frequently characters the characters appear appear is:

Input:

histogram(characters[:Appearances], legend=false)

Output:

svg

Looks kinda Ziphian. Not surprising. So who are the heavy hitters?

Input:

sort!(characters, [:Appearances], rev=true);
println.(characters[Symbol("Name/Alias")][1:5]);

Output:

Peter Benjamin Parker
Steven Rogers
James ""Logan"" Howlett
Anthony Edward ""Tony"" Stark
Thor Odinson

That’s Spiderman, Captain America, Wolverine, Ironman and Thor. Cool cool. So that is a bunch of dudes. How is the distribution of appreances is you separate out by gender:

Input:

histogram(characters[:Appearances], group=characters[:Gender], nbins=20, layout=2, color=["GREEN" "ORANGE"])

Output:

svg

Input:

ladies, dudes = groupby(characters, :Gender);
println("Number of characters (Ladies, Dudes)):\t\t", nrow.([ladies, dudes]))
println("Total appearances (Ladies, Dudes)):\t\t", sum.([ladies[:Appearances], dudes[:Appearances]]))
println("Median appearances (Ladies, Dudes)):\t\t", median.([ladies[:Appearances], dudes[:Appearances]]))

Output:

Number of characters (Ladies, Dudes)):		[115, 58]
Total appearances (Ladies, Dudes)):		[56358, 15273]
Median appearances (Ladies, Dudes)):		[156.0, 101.0]

Ok, well that tells a story. Note that the scale (vertical and horisontal) for the ladies plot is less than half that as for the dudes.

Lets see when characters were introduced, this is the year the characters join the avengers no the year they were first published (unfortunately). I suspect they correlate to some degree though.

Input:

sort!(characters,:Year)
scatter(characters[:Year],(173:-1:1); size=(800,3000), xlim=(1875,2040),legend=false,
    text=characters[Symbol("Name/Alias")].*" (" .* string.(characters[:Year]).*")",
    markercolor = ifelse.(characters[:Gender].=="MALE", "ORANGE", "GREEN")
)

Output:

svg

Anyway that is enough about comic books.
DataDeps.jl isn’t about processing data, or the stuff I can do with it.
It is about setting up data so you can do stuff with it

Conclusion

DataDeps.jl: get on it.
Sort out your data depenancies.
Make your scientific code easier for other people reproduce run by having it automatically download it’s data.
Make your packages easier to install by removing manual steps.
Spend less time worrying about setting up and managing your data, and more time analysing it and advancing science.