A look at AI datasets and some new work thats been released from Stanford NLP department.

A data treasure trove is available through torrents if you are unaware. Actually, there is 27.23TB of research training data available including gems like:

  • Live Tumor Segmentation
  • Electronic Microscopy, Hippocampus
  • Digital Surface & Digital Terrain Models

and many more — just checkout academictorrents.com.

~ For those relatively new to ML/DL and who have an interest in NLP — you should checkout the Standard CoreNLP library. It’s a java suite of core NLP tools written in Java and basically can take raw human language text input and give the base forms of words etc. Check it out if you are interested in the code via Github here. There is also a introductory article from the great folks over at Niklas Donges on Towards Data Science.

Other interesting links

There is a fascinating article that looks into the stylometry and the ability for ML to pattern match a programmer using only their compiled binary code. By decompiling the binary back into C++ and then running a model across it — you can disambiguate unique fingerprints for programmers. Link.

~ New work out of DeepMind talks about Neural Arithmetic Logic Units (paper) and the introduction of a neural arithmetic logic unit which can trained to track time, perform arithmetic over images of number and even translate numerical language into real-valued scalars. This has been a significant issue for neural networks in the past because they do not exhibit systematic generalization by rather just characterize counting through memorization rather than by abstraction (as humans do).

~ A Wall Street article making the rounds right now discusses IBM Watson’s Health issues as they relate to their AI efforts. Essentially, the article details the challenges that Watson has faced for oncology related projects and how its had a limited impact on patients mostly because Watson was not accurate enough. The article points out again, that a lack of scalable training data in rare or recurring cancers, can increasingly create false positives and the fact that treatments are evolving at a greater rate than data can be inputted and feature engineered. Interesting read on the challenges of real world ML/DL systems as it relates to healthcare (and the increasing need for better generalization). Link