Do you have some theory about which items belong in certain dimensions? I guess what I'm saying is can you look at content evidence for validity? Pull out items that don't fit based on expert knowledge grounded in theory. Then you may have less items to look at and could actually run CFA if you have a theory behind the latent trait.
I'd run some IRT stuff on these items and look at the discrimination and item difficulty (particularly the item characteristic curves). If a item discriminates poorly you may want to get rid of it. If it discriminates at either high or low ability you may want to get rid of it (depending on what you are trying to measure).
For your point 3 I don't think I'd use factor analysis for DIF detection. I believe that using the mantel haenszel or better still comparing the Null (the ability only), gender and then the gender x ability. There's debate in the literature about how to compare models, however, I'd look at the difference in in model 3 and model 1 (the interaction minus the null) and I think (R gives me the p values for this so I don't know off the top of my head) the critical value of DIF detection is 5.99. If you have sig there look at model 2 minus model 1 (that's uniform DIF) and also model 3 minus model 2 (non uniform DIF).