I am just trying to find an answer to this problem. If you have millions and millions of records stored in a map, you are sure to run out of memory. Here is some alternates..
MapDB
JDBM2
BERKELEY DB (Serialization overhead)
HSQLDB
You can also use effective hashing algorithms like :
Buzz Hash
http://www.java2s.com/Code/Java/Development-Class/AveryefficientjavahashalgorithmbasedontheBuzHashalgoritm.htm
Hash History
http://www.serve.net/buz/hash.adt/java.000.html
Some basics on HashMap
An instance of HashMap has two parameters that affect its performance:
initial capacity and load factor. The capacity is the number of buckets
in the hash table, and the initial capacity is simply the capacity at
the time the hash table is created. The load factor is a measure of how
full the hash table is allowed to get before its capacity is
automatically increased. When the number of entries in the hash table
exceeds the product of the load factor and the current capacity, the
hash table is rehashed (that is, internal data structures are rebuilt)
so that the hash table has approximately twice the number of buckets.
Your problem is that 1.7 GB raw Text is more than 1500 MB even without
the overhead added by the individual string objects. For huge mappings
you should either use a database or a file backed Map, these would use
disk memory instead of heap.
JDBM2 is
exactly what you are asking. It provides a HashMap backed up by disk
storage (among other maps). Its fast, thread-safe and the API is really
simple.
MapDB provides concurrent TreeMap and HashMap backed by disk storage or
off-heap-memory. It is a fast, scalable and easy to use embedded Java
database engine. It is tiny (~250KB jar), yet packed with features such
as transactions, space efficient serialization, instance cache and
transparent compression/encryption. It also has outstanding performance
rivaled only by native embedded db engines.
- As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
- However, a potentially hidden issue here is GC impact of giant maps.
I'm making an assumption that these maps are long lived. i.e. you
populate them and they stick around for the duration of the app. I'm
also assuming the app itself is long lived -- like a server of some
sort.
Each entry in a Java HashMap requires three objects: the key, the
value, and the Entry that ties them together. So 26M entries in the map
means 26M * 3 == 78M objects. This is fine until you hit a full GC.
Then you've got a pause-the-world problem. The GC will look at each of
the 78M objects and determine they're all alive. 78M+ objects is just a
lot of objects to look at. If your app can tolerate occasional long
(perhaps many seconds) pauses, there's no issue. If you're trying to
achieve any latency guarantees you could have a major issue (of course
if you want latency guarantees, Java's not the platform to choose :))
If the values in your maps churn quickly you can end up with frequent
full collects which compounds the problem greatly.
I don't know of a great solution to this issue. Ideas:
- It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
- If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
- You could create your own map impl and do explicit memory management
on byte[] (i.e. trade cpu for more predictable latency by serializing
millions of objects into a single byte[] -- ugh!)
- Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
- Hope that the new G1 collector will help (mainly applies to the high-churn case)
Just some thoughts from someone who has spent a lot of time with giant maps in Java.