There have been many studies on developing automatic tools for mapping CNN models onto FPGAs. However, challenges remain in designing an easy-to-use toolflow. First, the toolflow should be able to handle models exported from various deep learning frameworks and models with different topologies. Second, the hardware architecture should make better use of on-chip resources to achieve high performance. In this work, we build a toolflow upon Open Neural Network Exchange (ONNX) IR to support different DL frameworks. We also try to maximize the overall throughput via multiple hardware-level efforts. We propose to accelerate the convolution operation by applying parallelism not only at the input and output channel level, but also at the output feature map level. Several on-chip buffers and corresponding management algorithms are also designed to leverage abundant memory resources. Moreover, we employ a fully pipelined systolic array running at 400 MHz as the convolution engine, and develop a dedicated bus to implement the im2col algorithm and provide feature inputs to the systolic array. We generated 4 accelerators with different systolic array shapes and compiled 12 CNN models for each accelerator. Deployed on a Xilinx VCU118 evaluation board, the performance of convolutional layers can reach 3267.61 GOPS, which is 99.72% of the ideal throughput (3276.8 GOPS). We also achieve an overall throughput of up to 2424.73 GOPS. Compared with previous studies, our toolflow is more user-friendly. The end-to-end performance of the generated accelerators is also better than that of related work at the same DSP utilization.