MapDB
JDBM2
BERKELEY DB (Serialization overhead)
HSQLDB
You can also use effective hashing algorithms like :
Buzz Hash
http://www.java2s.com/Code/Java/Development-Class/AveryefficientjavahashalgorithmbasedontheBuzHashalgoritm.htm
Hash History
http://www.serve.net/buz/hash.adt/java.000.html
Some basics on HashMap
An instance of HashMap has two parameters that affect its performance: initial capacity and load factor. The capacity is the number of buckets in the hash table, and the initial capacity is simply the capacity at the time the hash table is created. The load factor is a measure of how full the hash table is allowed to get before its capacity is automatically increased. When the number of entries in the hash table exceeds the product of the load factor and the current capacity, the hash table is rehashed (that is, internal data structures are rebuilt) so that the hash table has approximately twice the number of buckets.
Your problem is that 1.7 GB raw Text is more than 1500 MB even without the overhead added by the individual string objects. For huge mappings you should either use a database or a file backed Map, these would use disk memory instead of heap.
JDBM2 is exactly what you are asking. It provides a HashMap backed up by disk storage (among other maps). Its fast, thread-safe and the API is really simple.
MapDB provides concurrent TreeMap and HashMap backed by disk storage or off-heap-memory. It is a fast, scalable and easy to use embedded Java database engine. It is tiny (~250KB jar), yet packed with features such as transactions, space efficient serialization, instance cache and transparent compression/encryption. It also has outstanding performance rivaled only by native embedded db engines.
- As discussed at length in other posts, with a good hashCode(), 26M entries in a Map is no big deal.
- However, a potentially hidden issue here is GC impact of giant maps.
Each entry in a Java HashMap requires three objects: the key, the value, and the Entry that ties them together. So 26M entries in the map means 26M * 3 == 78M objects. This is fine until you hit a full GC. Then you've got a pause-the-world problem. The GC will look at each of the 78M objects and determine they're all alive. 78M+ objects is just a lot of objects to look at. If your app can tolerate occasional long (perhaps many seconds) pauses, there's no issue. If you're trying to achieve any latency guarantees you could have a major issue (of course if you want latency guarantees, Java's not the platform to choose :)) If the values in your maps churn quickly you can end up with frequent full collects which compounds the problem greatly.
I don't know of a great solution to this issue. Ideas:
- It's sometimes possible to tune GC and heap sizes to "mostly" prevent full GCs.
- If your map contents churn a lot you could try Javolution's FastMap -- it can pool Entry objects, which could lower the frequency of full collects
- You could create your own map impl and do explicit memory management on byte[] (i.e. trade cpu for more predictable latency by serializing millions of objects into a single byte[] -- ugh!)
- Don't use Java for this part -- talk to some sort of predictable in-memory DB over a socket
- Hope that the new G1 collector will help (mainly applies to the high-churn case)
No comments:
Post a Comment