Recently, LDPC codes are considered for applications such as high-speed satellite and optical communications, the hard disk drives, and high- density flash memory based storage systems, which require that the codes are free of error-floor down to extremely low bit error rates. FPGAs are usually used to evaluate the error performance of codes. However, existing FPGA-based LDPC decoders fail to utilize the configurability and read-first mode of embedded memory in the FPGAs, and thus result in limited throughput and codes sizes. Four optimization techniques, i.e., vectorization, folding, message relocation, and circulant permutation matrix sharing, are proposed to improve the throughput, scalability, and efficiency of FPGA-based decoders. Using the above techniques, codes are shown to have no error-floor down to the BER of 10E-14. Also, it is very difficult to construct codes that do not exhibit an error floor down to 10E-15 or so. Without detailed knowledge of dominant trapping sets, a backtracking- based reconfigurable decoder is designed to lower the error floor of a family of structurally compatible quasi-cyclic LDPC codes by one to two orders of magnitudes.