Tuesday, June 25, 2013

Excel VLookup

Excel vlookup can be used to lookup a key and print its corresponding value. Very convenient when you quickly need to show desc from a set of values

Formula : =vlookup(col to lookup,array range to lookup, col num of the rest, TRUE = approximate match, FALSE = exact match)

Also to keep the range from changing when you copy over your formula to other cells,

When you enter the range in your formula, to keep the range from changing, you will need to enter the range as an Absolute Reference.  By that I mean, the range will be absolute, or will not change, when you extend the formula from the first cell to the rest.

To make the range Absolute, put a Dollar Sign ($)in front of both the column letter and the row number in the cell range.  For example, if the cell range is D1:F100 (Columns D, E, and F, from rows 1 through row 100), to make it an absolute range, so it will not change, enter the range in your formula as: $D$1:$F$100.  Now when you copy the formula down to subsequent rows, the range of D1:F100 will remain constant, or Absolute, in each of the formulas as they are copied down,and will not change as you copy the formula down.


Monday, June 3, 2013

Java HashMap with millions and millions of records

I am just trying to find an answer to this problem. If you have millions and millions of records stored in a map, you are sure to run out of memory. Here is some alternates..

MapDB
JDBM2
BERKELEY DB (Serialization overhead)
HSQLDB 

You can also use effective hashing algorithms like :

Buzz Hash
http://www.java2s.com/Code/Java/Development-Class/AveryefficientjavahashalgorithmbasedontheBuzHashalgoritm.htm

Hash History
http://www.serve.net/buz/hash.adt/java.000.html


Some basics on HashMap

An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.

Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.


JDBM2 is exactly what you are asking. It provides a HashMap backed up by disk storage (among other maps). Its fast, thread-safe and the API is really simple.

MapDB provides concurrent TreeMap and HashMap backed by disk storage or off-heap-memory. It is a fast, scalable and easy to use embedded Java database engine. It is tiny (~250KB jar), yet packed with features such as transactions, space efficient serialization, instance cache and transparent compression/encryption. It also has outstanding performance rivaled only by native embedded db engines.


  1. As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
  2. However, a potentially hidden issue here is GC impact of giant maps.
I'm making an assumption that these maps are long lived. i.e. you populate them and they stick around for the duration of the app. I'm also assuming the app itself is long lived -- like a server of some sort.
Each entry in a Java HashMap requires three objects: the key, the value, and the Entry that ties them together. So 26M entries in the map means 26M * 3 == 78M objects. This is fine until you hit a full GC. Then you've got a pause-the-world problem. The GC will look at each of the 78M objects and determine they're all alive. 78M+ objects is just a lot of objects to look at. If your app can tolerate occasional long (perhaps many seconds) pauses, there's no issue. If you're trying to achieve any latency guarantees you could have a major issue (of course if you want latency guarantees, Java's not the platform to choose :)) If the values in your maps churn quickly you can end up with frequent full collects which compounds the problem greatly.
I don't know of a great solution to this issue. Ideas:
  • It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
  • If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
  • You could create your own map impl and do explicit memory management on byte[] (i.e. trade cpu for more predictable latency by serializing millions of objects into a single byte[] -- ugh!)
  • Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
  • Hope that the new G1 collector will help (mainly applies to the high-churn case)
Just some thoughts from someone who has spent a lot of time with giant maps in Java.