Using MongoDB for great science, part 2.

As you may remember, some months ago I had decided to use MongoDB for my masters project, and had a few rather large problems with it. After posting the article, many people were quick to point out that I was using a development version of MongoDB, so I chalked much of the post up to my own error.

Having found the flexibility of its schemaless design very convenient, I decided to give it one more chance, and (after upgrading to the 64-bit version) I continued to gather data.

One thing I noticed as I was gathering data was that it got slower and slower, until writes took about 15 seconds each. What’s more, it seemed to be using lots of memory. I went into IRC to ask what might be wrong, and I was informed that this was because I had five indexes.

Anyone familiar with a relational database will surely wonder why five indexes is anything to sweat about, but not so here. Apparently, having five indexes on three million documents (each document having two lists of 10 items each, on average) makes it consume too much memory. I asked how much memory I needed to give it to process my indexes comfortably, and the answer was “about 9 GB”.

Not being made of money, I decided to delete the indexes instead, and performance immediately picked up and memory usage dropped (I wasn’t using the indexes much anyway). This allowed my data gathering to complete and I started to think about performing the next stage of computations.

My data is a large graph of three million nodes and about thirty million edges, and I needed to make sure it was undirected (i.e. every edge from node A to node B should also have a corresponding edge from B to A, in my representation). To do this, I wrote a script to get each node and add the edges that don’t exist. Not a very fast operation, but hey, how long can it take?

About fifteen days, apparently. MongoDB could only process about 200,000 items a day, and that’s with no indexes at all. Still, I left it, thinking speed would pick up.

What picked up instead was memory usage, and my OS killed the process because it ran out of memory. I started it up again, it recommended a repair, I performed it and then launched it.

I then went to query something and noticed something odd. After a bit of poking around, it turns out that half the data in the database is missing.

Luckily, knowing what I was dealing with I had made a backup two days ago. It means I’ve lost two days of work, but at least it’s not worse. I am restoring from the backup right now and moving to another database (postgres or SQLite) as soon as I can.

(EDIT: Turns out that the database had been silently corrupted days ago and all the backups had data missing. I had to restore from the original backup and reprocess everything)

It goes without saying, of course, that I will not be touching MongoDB again with a ten-foot pole.

Stavros' Stuff

On programming and other things.

Using MongoDB for great science, part 2.

Conceived on Jun 3, 2010

Subscribe to my mailing list

Stavros

Guy who likes computers

Connect with me

This site is part of the webring:

Recent Posts

Made with ♥ in Greece