pytext v0.3.3 Release Notes

Release Date: 2020-06-08 // about 6 years ago
  • 🆕 New features

    • ➕ Add XLM-R document classification server + console (#1358)
    • MLP layer embed for float tensors and FloatListSeqTensorizer for List[List[[float]] features. (#1374)
    • ➕ Add class_accuracy in MultiLabelSoftClassificationMetrics (#1371)
    • ➕ Add an option to skip test run after models have been trained (#1372)
    • 👌 Support DP in PyText (#1366)
    • Support torchscriptify in multi_label_classification_layer (#1350)
    • ➕ Add custom metric class for reporting Joint model metrics (#1339)
    • MultiLabel-MultiClass Model for Joint Sequence Tagging (#1335)
    • 👍 Scripted tokenizer support for DocModel (#1314)

    🛠 Bugfixes

    • 🛠 Fixed metric reporter aggregation and output layer for the multi-label classification
    • Remove move_state_dict_to_gpu, which is causing CUDA OOM (#1367)
    • 🛠 Fix Flow's default conversion of dict to AttrDict
    • 🛠 Fix bug in ClassificationOutputLayer that pad_idx is never respected (#1347)
    • 🛠 Serializing/Deserializing type Any: bugfix and simplification (#1344)
    • 🛠 Fix RoBERTa Q&A Training Bug with multiple BoS tokens. (#1343)

    Other

    • 👍 Better error message for misconfigured data fields
    • 🗄 Replace deprecated integer division with floor division operator
    • ➕ Add informative prints to assert statements (#1360)
    • TorchScript: Put dense tensor on the same device with other input tensors (#1361)
    • ⚡️ Update PyTorch + ONNX (#1340)
    • ⚡️ Update PyTorch + ONNX (#1340)- binary ONNX
    • ⚡️ Update PR Template (#1349)
    • ⬇️ Reduce memory request for pytext train operator
    • ➕ Add 'contrib' directory for experimental code (#1333)

Previous changes from v0.3.2

  • 🆕 New features

    • ➕ Add Roberta model into BertPairwiseModel (#1336)
    • 👌 Support read file from http URL (#1317)
    • add a new PyText get_num_examples_from_batch function in model (#1319)
    • ➕ Add support for length label smoothing (#1308)
    • ➕ Add new metrics type for Masked Seq2Seq Joint Model (#1304)
    • ➕ Add mask generator and strategy (#1302)
    • ➕ Add separate logging for label loss and length loss (#1294)
    • ➕ Add tensorizer support for masking of target tokens (#1297)
    • ➕ Add length prediction and basic masked generator (#1290)
    • Add self attention option to conv_encoder and conv_decoder (#1291)
    • Entity Saliency modeling on PyText: EntitySalienceMetricReporter/EntitySalienceTask
    • In-batch negative training for BertPairwiseModel
    • 👌 Support embedding from decoder (#1284)
    • ➕ Add dense features to Roberta
    • ➕ Add projection layer to HuggingFace encoder (#1273)
    • ➕ add PyText Embedding TorchScript Wrapper
    • ➕ Add option to pad missing label in LabelListTensorizer (#1269)
    • ↔ Integrate PET and Introduce ElasticTrainer (#1266)
    • 👌 support PoolingType in DocNN. (#1259)
    • ➕ Added WordSeqEmbedding (#1255)
    • Open source Assistant NLU seq2seq model (#1236)
    • 👌 Support multi label classification
    • BART in decoupled model

    🐛 Bug fixes

    • 🛠 Fix Incorrect State Dict Assumption (#1326)
    • 🐛 Bug fix for "RoBERTaTensorizer object has no attribute is_input" (#1334)
    • Cast model output to cpu (#1329)
    • 🛠 Fix OSS predict-py API (#1320)
    • 🛠 Fix "calling median on empty tensor" issue in MR (#1322)
    • ➕ Add ScriptDoNothingTokenizer so that torchscriptification of SPM does not fail (#1316)
    • 🛠 Fix creating generator everytime (#1301)
    • 🛠 fix dense feature for fp16
    • 👀 Avoid edge cases with quantization by setting a known seed (#1295)
    • 👉 Make torchscript predictions even on empty text / token inputs
    • 🛠 fix dense feature TorchScript typing (#1281)
    • avoid zero division error in metrics reporter (#1271)
    • 🛠 Fix contiguous issue in bilstm export (#1270)
    • 🛠 fix debug file generation for multilabel classification (#1247)
    • 🛠 Fix fp16 optimizer attribute name

    Other

    • Simplify contextual embedding dimension computation in PyText (#1331)
    • 🆕 New Debug File for masked seq2seq
    • 🚚 Move MockConfigLoader to OSS (#1324)
    • ⚡️ Pass in optimizer config instead of create_optimizer to trainer
    • ✂ Remove unnecessary torch.no_grad() block (#1323)
    • 🛠 Fix Memory Issues in Metric Reporter for Classification Tasks over large Label Spaces
    • ➕ Add contextual embedding support to OS seq2seq model (#1299)
    • recover xlm_r tutorial notebook (#1305)
    • Enable controlling bias in MLP decoder
    • Migrate serving tutorial to TorchScript (#1310)
    • ✂ delete caffe2 export (#1307)
    • ➕ add whitelist for ONNX export
    • 👉 Use dynamic quantization api for BeamSearch (#1303)
    • ✂ Remove requirement that eos/bos be supplied for sequence export. (#1300)
    • 👍 Multicolumn support
    • 👍 Multicolumn support in torchscriptify
    • ➕ Add caching support to RawExample and batch predict API (#1298)
    • ➕ Add save-pytext-snapshot command to PyText cmdline (#1285)
    • ⚡️ Update with Whatsapp calling data + support dictionary features (#1293)
    • add arrange_caffe2_model_inputs in BaseModel (#1292)
    • ✅ Replace unit-tests on LMModel and FLLanguageModelingTask by LiteLMModel and FLLiteLMTask (#1296)
    • 🔄 changes to make mbart work (#1911)
    • 🖐 handle encoder and decoder embedding
    • ➕ Add tutorial for semantic parsing. (#1288)
    • ➕ Add new fb beam search with fused operator (#1287)
    • 🏗 Move generator builder to constructor so that it can easily overridden. (#1286)
    • Torchscriptify ELTensorizer (#1282)
    • Torchscript export for Seq2Seq model (#1265)
    • 🔄 Change Seq2Seq model from_config() to a more general api (#1280)
    • add max_seq_len to DocNN TorchScript model (#1279)
    • 👌 support XLM-R model Embedding in TorchScript (#1278)
    • Generic PyText Checkpoint Manager Interface (#1267)
    • 🛠 Fix backward compatibility issue of pad_missing in LabelListTensorizer (#1277)
    • ⚡️ Update mean reduction in NLLLoss (#1272)
    • migrate pages.integrity.scam.docnn_models.xxx (#1275)
    • Unify model input for ByteTokensDocumentModel (#1274)
    • Torchscriptify TokenTensorizer
    • 👍 Allow dictionaries to overwrite entries with #fairseq:overwrite comment (#1073)
    • 👉 Make WordSeqEmbedding ONNX compatible
    • If the snapshot path provided is not valid, throw error (#1268)
    • 👌 support vocab filter by min count
    • Unify input for TorchScript Tensorizers and Models (#1256)
    • Torchscriptify XLM-R
    • ➕ Add class logging to task (#1264)
    • ➕ Add usage logging to exporter (#1262)
    • ➕ Add usage logging across models (#1263)
    • 🌲 Usage logging on data classes (#1261)
    • 👍 GPT2 BPE add lower casing support (#1260)
    • FAISS Embedding Search Space [3/5]
    • Return len of tokens of each sequence in SeqTokenTensorizer (#1254)
    • Vocab Limited Pretrained Embedding [2/5] (#1248)
    • ➕ add Stage.OTHERS and allow TB to print to a seperate prefix not in (TRAIN, TEST, EVAL) (#1258)
    • ➕ Add option to skip 2 stage tokenizer and bpe decode sequences in the debug file (#1257)
    • ➕ Add Testcase for Wordpiece Tokenizer (#1249)
    • modify accuracy calculation for multi-label classification (#1244)
    • Enable tests in pytext/config:pytext_all_config_test
    • 🌲 Introduce Class Usage Logging (#1243)
    • 👉 Make PyText compatible with Any type (#1242)
    • 👉 Make dict_embedding Torchscript friendly (#1240)
    • 👌 Support MultipleData for export and kd generation
    • ✂ delete flaky/broken tests (#1238)
    • ➕ Add support for returning start & end indices.