The last week I was in the Latinamerican Conference in Informatics. Three to five were running at the same time. It was an overwhelming schedule that serve more as a restaurant menu. Papers from history of computer science to  biological computing. However, the best part of the event was listening to John E. Hopcroft  in his tutorial “Mathematics for the Information Era” and his conference “New directions in computer science”.

Both of them were related with the constantly increasing and, since long time ago already, intractable amount of information available to us. We generate an amount of information equivalent to a couple of hundreds newspapers per day per person. Of course that include junk and data generated by computers themselves. Still, information should be useful to us, unless we are just trying to create the biggest museum of the world (called Internet) receiving visits just to take pictures.

But there is nothing new in what I just said. What is new is the summary that Hopcroft manage to put together. He gave sort of the set of the “new” mathematical tools that we had to attack the problem of Big Information. Here is the list that I can extract from my notebook and a very brief explanation (forgive if I am not precise as him or if I misunderstood any of the concepts):

High dimensionality: things in many dimensions doesn’t behave as things in 1,2, 3 or even 4 dimensions. There is a new set of theories that are being developing in this area increasing our understanding of how complex problems (with many dimensions, such as texts) may work. Some of the next topics are related with high dimensionality. This is more like a new whole area of study.

Volumes: in many dimensions volumes does not behave as we expect. For example, as dimensions tent to infinite, the volume of the sphere (or hypersphere) goes to zero. For the Gaussian surface things are even more unexpected. It looks like a ring. For that reason, its consequences are totally nuts. Let say you have two Gaussians  that generated points in a high dimensional space. Then you can deduce which of the Gaussian generate each point (with extremely good accuracy).

Dimension reduction: if you have some vectors in a (very) high dimensional space, you can decrease the numbers of dimensions and still distinguish between them (very different when you reduce from 3 dimension to 2 dimiensions).

SVM: One of the breakthrough in machine learning (and Artificial Intelligence, unfortunately he didn’t have more time to continue on this topic). This could be seen opposed as the previous one. This is, the possibility of increase the number of dimensions of the vectors so it is to classify (learn) quicker the solutions of problems. For example, try to draw a straight line that separates the “x” from “+”.

x + x x  ++++  x x + x

Now, the mathematical magic come and increase the dimensions. Now, try again!

+         ++++         +

x     x x              x x     x

Of course, it is not as simply at that, there is a lot hidden in the mathematical magic part. It is important to notice that computers are better learning single lines than curves.

Sparse Vectors: in nature, if you have a huge vector representing, say a genome, most of the values (gens) are 0. Even more, the sparse solution is unique. This is why geneticists are able to crack out ADN. This also reminds me that in nature, interactions of 3 factors (or more) are rare. This is great for statistics in which we look for low interactions or no interactions (independent variables).

Probability and statistics: no need to explain what is this. But most of the big advances in the areas that are producing big information is going to require more than means and standard deviations. It took almost a year to generate enough collitions and, just then, be able to confirm the Gibbs Bosson with 5 sigmas, a statisc.

On the infinite and beyond:  well, basically calculus. A good understanding of certain concepts such as limits, derivatives, singularities.

Others: Markov chains (random walks), generative models for producing graphs (richer gets richer concept), giant components, ranking and voting (and its problems), boosting, zero knowledge proof, sampling (accepting you cannot store or process all the information, how can you sample to get the right answer?)

In general the good news is that if you have huge information and you lose some, you are still able to find all the answers because the structure and properties doesn’t change. This reminds me “The library of Babel” of Jorge Luis Borges.

Hopcroft points out seven problems in the new directions of computer science:

1. Track ideas in scientific literature (a machine who tells you the key articles of a particular topic)
2. Evolution of communities in social networks
3. Extract information from unstructured data sources}
4. Processing massive data sets and streams
5. Detecting signals from noise
6. Dealing with high dimensional data
7. Much more application oriented…

It sounds a bit like Cultureplex, doesn’t it?

“The information age is a fundamental revolution that is changing all aspects of our lives. Those individuals, institutions and nations who recognize this change and position themselves for the future will benefit enormously” John E. Hopcroft.